Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to deploy ArgoCD with HA #11388

Open
3 tasks done
Akinorev opened this issue Nov 21, 2022 · 37 comments · Fixed by #11862
Open
3 tasks done

Unable to deploy ArgoCD with HA #11388

Akinorev opened this issue Nov 21, 2022 · 37 comments · Fixed by #11862
Labels

Comments

@Akinorev
Copy link

Akinorev commented Nov 21, 2022

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

ArgoCD is unable to deploy correctly with HA. This happens on the namespace of argocd-installation

To Reproduce

Upgrade from 2.4.6 to 2.5.1 or 2.5.2

Expected behavior

ArgoCD is upgraded/deployed successfully

Version

2.5.2 and 2.5.1 (same issue on both versions)

Logs

ha proxy:

[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:9] for proxy health_check_http_url: cannot create receiving socket (Address family not supported by protocol) for [:::8888]
[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:56] for frontend ft_redis_master: cannot create receiving socket (Address family not supported by protocol) for [:::6379]
[ALERT]    (1) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.

redis ha:

21 Nov 2022 16:22:36.369 # Configuration loaded
21 Nov 2022 16:22:36.370 * monotonic clock: POSIX clock_gettime
21 Nov 2022 16:22:36.377 # Warning: Could not create server TCP listening socket ::*:6379: unable to bind socket, errno: 97
21 Nov 2022 16:22:36.378 * Running mode=standalone, port=6379.
21 Nov 2022 16:22:36.378 # Server initialized
21 Nov 2022 16:22:36.379 * Ready to accept connections

repository server:

time="2022-11-21T16:25:46Z" level=info msg="ArgoCD Repository Server is starting" built="2022-11-07T16:42:47Z" commit=148d8da7a996f6c9f4d102fdd8e688c2ff3fd8c7 port=8081 version=v2.5.2+148d8da
time="2022-11-21T16:25:46Z" level=info msg="Generating self-signed TLS certificate for this session"
time="2022-11-21T16:25:46Z" level=info msg="Initializing GnuPG keyring at /app/config/gpg/keys"
time="2022-11-21T16:25:46Z" level=info msg="gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe238040569" dir= execID=9e8d3
time="2022-11-21T16:25:52Z" level=error msg="`gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe238040569` failed exit status 2" execID=9e8d3
time="2022-11-21T16:25:52Z" level=info msg=Trace args="[gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe238040569]" dir= operation_name="exec gpg" time_ms=6031.865355
time="2022-11-21T16:25:52Z" level=fatal msg="`gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe238040569` failed exit status 2"
@Akinorev Akinorev added the bug Something isn't working label Nov 21, 2022
@makeittotop
Copy link
Contributor

makeittotop commented Nov 22, 2022

In my tests, a vanilla HA installation of v2.5.2, or an upgrade to it (v2.5.1 -> v2.5.2 for example) both fails at the redis-ha-server sts component.

# kubectl get pods
NAME                                                READY   STATUS     RESTARTS      AGE
argocd-redis-ha-haproxy-755db98494-pnkbq            1/1     Running    0             14m
argocd-redis-ha-haproxy-755db98494-q5tmw            1/1     Running    0             14m
argocd-redis-ha-haproxy-755db98494-hjj29            1/1     Running    0             14m
argocd-redis-ha-server-0                            3/3     Running    0             14m
argocd-redis-ha-server-1                            3/3     Running    0             13m
argocd-redis-ha-haproxy-5b8f6b7fdd-7q7gh            0/1     Pending    0             3m7s
argocd-applicationset-controller-57bfc6fdb8-phstq   1/1     Running    0             3m7s
argocd-server-6f4c7b9859-dlln8                      1/1     Running    0             3m6s
argocd-notifications-controller-954b6b785-jwwg8     1/1     Running    0             3m2s
argocd-repo-server-569dc6f989-xgnnw                 1/1     Running    0             3m6s
argocd-dex-server-866c9bdd5b-rxb8x                  1/1     Running    0             3m7s
argocd-server-6f4c7b9859-twn6w                      1/1     Running    0             3m1s
argocd-application-controller-0                     1/1     Running    0             3m2s
argocd-repo-server-569dc6f989-h478x                 1/1     Running    0             2m56s
argocd-redis-ha-server-2                            0/3     Init:0/1   1 (32s ago)   2m4s


# kubectl logs argocd-redis-ha-server-2 -c config-init
Tue Nov 22 04:30:41 UTC 2022 Start...
Initializing config..
Copying default redis config..
  to '/data/conf/redis.conf'
Copying default sentinel config..
  to '/data/conf/sentinel.conf'
Identifying redis master (get-master-addr-by-name)..
  using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again

Sounds a little off that the redis-ha-server component is waiting for itself...?

@acartag7
Copy link

I'm having the same issue in my namespaced ha install; it seems like the issue is similar to a previous problem with Redis and ipv6. After adding the bind to 0.0.0.0 in the config for sentinel and redis.conf it starts the DB fine, but the HA proxy still shows as 0 masters available, and also, argocd-server is complaining of a timeout against the database.

@ghost
Copy link

ghost commented Nov 23, 2022

I'm also having a similar issue when using ArgoCD HA v2.5.2 all argocd-redis-ha-haproxy pods go into Init:CrashLoopBackOff. I'm running on a GKE cluster version 1.23.11-gke.300. Downgrading to ArgoCD HA v2.4.17 fixed it for me. I can provide more information about my setup if useful.

@34fathombelow
Copy link
Member

If everyone could please provide me a few additional details about your particular cluster setup in your comments.
Cluster type? Eg. GKE, AWS, Azure, Digital Ocean?
CNI your are using?
Kubernetes version
IP family? IPv4, IPv6, dual stack, or IPv6 disabled
Are you using a service mesh?

@otherguy
Copy link

otherguy commented Nov 24, 2022

Same issue here.

  • Cluster type: GKE
  • Kubernetes version: 1.24 (v1.24.4-gke.800)
  • IP family: IPv4
  • Service mesh: no

Happening with v2.5.1 and v2.5.2

@acartag7
Copy link

I had the issue in version v2.5.1, and v2.5.2 had to rollback to 2.4.6 where it is working fine.
Cluster type: TKG-based Cluster
CNI: Antrea
Kubernetes: 1.19.9
IP family: IPv6 disabled
Are you using a service mesh: no

@34fathombelow
Copy link
Member

I created PR #11418 if you could please test the HA manifest in a dev environment and provide feedback. This will be based on the master branch and is not suitable for production. IPv6 only environments will not be compatible.

I will also conduct testing on my side over the next few days.

@Glutamat42
Copy link

Glutamat42 commented Nov 26, 2022

  • Provider: DigitalOcean
  • Kubernetes Version: 1.24.4-do.0
  • Default settings (not sure if ipv6 is enabled, can't find an option for it)
  • No service mesh deployed (only ArgoCD deployed to cluster)

My results:

  • v2.5.2 non HA: All pods are starting
  • v2.4.17 HA: All pods are starting
  • v2.5.2 HA: Redis not starting
argocd-redis-ha-haproxy-59b5d8568b-kcvz6           0/1     Init:Error              2 (2m25s ago)   6m41s
argocd-redis-ha-haproxy-59b5d8568b-pbpjf           0/1     Init:CrashLoopBackOff   2 (17s ago)     6m41s
argocd-redis-ha-haproxy-59b5d8568b-ssnmq           0/1     Init:CrashLoopBackOff   2 (20s ago)     6m41s
argocd-redis-ha-server-0                           0/3     Init:Error              3 (2m2s ago)    6m41s

# logs argocd-redis-ha-server-0 -n argocd -c config-init
Sat Nov 26 14:20:03 UTC 2022 Start...
Initializing config..
Copying default redis config..
  to '/data/conf/redis.conf'
Copying default sentinel config..
  to '/data/conf/sentinel.conf'
Identifying redis master (get-master-addr-by-name)..
  using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again

# logs argocd-redis-ha-haproxy-59b5d8568b-kcvz6 -n argocd -c config-init
Waiting for service argocd-redis-ha-announce-0 to be ready (1) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (2) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (3) ...
...

Most of the time Status of failing Pods is Init:0/1

@otherguy
Copy link

I can confirm that this is solved with 2.5.3.

Thank you!

@Glutamat42
Copy link

Can also confirm this is fixed for me with 2.5.3
Thanks :)

@acartag7
Copy link

acartag7 commented Dec 1, 2022

I tried @34fathombelow solution. Now the pods are starting, but I still have an issue with Redis:

From redis pods:

1:C 01 Dec 2022 11:07:19.788 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 01 Dec 2022 11:07:19.788 # Redis version=7.0.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 01 Dec 2022 11:07:19.788 # Configuration loaded
1:M 01 Dec 2022 11:07:19.789 * monotonic clock: POSIX clock_gettime
1:M 01 Dec 2022 11:07:19.792 # Warning: Could not create server TCP listening socket ::*:6379: unable to bind socket, errno: 97
1:M 01 Dec 2022 11:07:19.793 * Running mode=standalone, port=6379.
1:M 01 Dec 2022 11:07:19.793 # Server initialized
1:M 01 Dec 2022 11:07:19.794 * Ready to accept connections

ha proxy pods start failing but eventually are up:

[WARNING] (7) : Server bk_redis_master/R0 is DOWN, reason: Layer4 timeout, info: " at step 1 of tcp-check (connect)", check duration: 3001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] (7) : Server bk_redis_master/R1 is DOWN, reason: Layer4 timeout, info: " at step 1 of tcp-check (connect)", check duration: 3001ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] (7) : Server bk_redis_master/R2 is DOWN, reason: Layer4 timeout, info: " at step 1 of tcp-check (connect)", check duration: 3001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] (7) : backend 'bk_redis_master' has no server available!
[WARNING] (7) : Server bk_redis_master/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 7ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] (7) : Server check_if_redis_is_master_0/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] (7) : Server check_if_redis_is_master_0/R1 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] (7) : Server check_if_redis_is_master_0/R2 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

argocd-server has the following errors all the time:

redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF
redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF
redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF
redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF
redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF

@kelly-brown
Copy link

I just found this issue. Trying to upgrade from 2.4.17 to 2.5.5 and I'm running into the original error. Should I just follow this issue and try back when I see it closed, or do you guys need some help testing/validating the fix?

Thanks!

@rumstead
Copy link
Member

rumstead commented Dec 29, 2022

#5957 feels related. We also see the same issue with an IPv4 cluster on a TKG cluster.

EDIT: Confirmed, adding bind 0.0.0.0 to redis and sentinel fixed the issue.

rumstead added a commit to rumstead/argo-cd that referenced this issue Dec 30, 2022
rumstead added a commit to rumstead/argo-cd that referenced this issue Jan 4, 2023
rumstead added a commit to rumstead/argo-cd that referenced this issue Jan 4, 2023
crenshaw-dev added a commit that referenced this issue Jan 10, 2023
…1388) (#11862)

* fix(redis): explicit bind to redis and sentinel for IPv4 clusters #11388

Signed-off-by: rumstead <[email protected]>

* fix(redis): run manifests generate

Signed-off-by: rumstead <[email protected]>

* fix(redis): run manifests generate

Signed-off-by: rumstead <[email protected]>

* Retrigger CI pipeline

Signed-off-by: rumstead <[email protected]>

Signed-off-by: rumstead <[email protected]>
Co-authored-by: Michael Crenshaw <[email protected]>
crenshaw-dev added a commit that referenced this issue Jan 10, 2023
…1388) (#11862)

* fix(redis): explicit bind to redis and sentinel for IPv4 clusters #11388

Signed-off-by: rumstead <[email protected]>

* fix(redis): run manifests generate

Signed-off-by: rumstead <[email protected]>

* fix(redis): run manifests generate

Signed-off-by: rumstead <[email protected]>

* Retrigger CI pipeline

Signed-off-by: rumstead <[email protected]>

Signed-off-by: rumstead <[email protected]>
Co-authored-by: Michael Crenshaw <[email protected]>
crenshaw-dev added a commit that referenced this issue Jan 10, 2023
…1388) (#11862)

* fix(redis): explicit bind to redis and sentinel for IPv4 clusters #11388

Signed-off-by: rumstead <[email protected]>

* fix(redis): run manifests generate

Signed-off-by: rumstead <[email protected]>

* fix(redis): run manifests generate

Signed-off-by: rumstead <[email protected]>

* Retrigger CI pipeline

Signed-off-by: rumstead <[email protected]>

Signed-off-by: rumstead <[email protected]>
Co-authored-by: Michael Crenshaw <[email protected]>
@FrittenToni
Copy link

FrittenToni commented Jan 17, 2023

Hi @crenshaw-dev,

I just wanted to report that we're still facing the issue with version 2.5.6 and ha setup. We just upgraded our argo dev instance from v2.4.8 to 2.5.6 via kubectl apply -n argocd-dev -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.5.6/manifests/ha/install.yaml and now our argocd-redis-ha-server-0 pod is no longer coming up due to:

Tue Jan 17 09:05:44 UTC 2023 Start...
Initializing config..
Copying default redis config..
to '/data/conf/redis.conf'
Copying default sentinel config..
to '/data/conf/sentinel.conf'
Identifying redis master (get-master-addr-by-name)..
using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Tue Jan 17 09:06:59 UTC 2023 Did not find redis master ()
Identify announce ip for this pod..
using (argocd-redis-ha-announce-0) or (argocd-redis-ha-server-0)
identified announce ()
/readonly-config/init.sh: line 239: Error: Could not resolve the announce ip for this pod.: not found
Stream closed EOF for argocd-dev/argocd-redis-ha-server-0 (config-init)

@seanmmills
Copy link

I am also experiencing the same issue @FrittenToni describes above. argocd-redis-ha-server starts up fine in 2.4.19, but fails on 2.5.5, 2.5.6, and 2.5.7.

@crenshaw-dev crenshaw-dev reopened this Jan 21, 2023
emirot pushed a commit to emirot/argo-cd that referenced this issue Jan 27, 2023
…goproj#11388) (argoproj#11862)

* fix(redis): explicit bind to redis and sentinel for IPv4 clusters argoproj#11388

Signed-off-by: rumstead <[email protected]>

* fix(redis): run manifests generate

Signed-off-by: rumstead <[email protected]>

* fix(redis): run manifests generate

Signed-off-by: rumstead <[email protected]>

* Retrigger CI pipeline

Signed-off-by: rumstead <[email protected]>

Signed-off-by: rumstead <[email protected]>
Co-authored-by: Michael Crenshaw <[email protected]>
Signed-off-by: emirot <[email protected]>
@jas01
Copy link

jas01 commented Feb 6, 2023

Same problem with 2.5.10 on OKD 4.12. The argocd-redis-ha-server startup fine in 2.4.19 buts faitls on 2.5.10

@otherguy
Copy link

otherguy commented Feb 6, 2023

Same here. Only 2.5.x version that's working is v2.5.3+0c7de21

@johnoct-au
Copy link

johnoct-au commented Feb 13, 2023

Same here. failing on 2.5.6, 2.5.10 deployment and 2.6.1

@otherguy
Copy link

Did someone try 2.6.2?

@jas01
Copy link

jas01 commented Feb 22, 2023

Did someone try 2.6.2?

Just did, same result.

pod/argocd-redis-ha-haproxy-c85b7ffd6-kh56p             0/1     Init:CrashLoopBackOff   18 (4m59s ago)   110m
pod/argocd-redis-ha-haproxy-c85b7ffd6-lsbmj             0/1     Init:0/1                19 (5m21s ago)   110m
pod/argocd-redis-ha-haproxy-c85b7ffd6-qktcv             0/1     Init:0/1                19 (5m9s ago)    110m
pod/argocd-redis-ha-server-0                            0/3     Init:CrashLoopBackOff   20 (3m39s ago)   110m

@johnoct-au
Copy link

not sure if this was anyones problem but for my specific issue, I was scaling the argocd-redis-ha from 3 to 5 but the chart only deploys 3 argocd-redis-ha-announce-services so I had to deploy two additional ones

@rimasgo
Copy link

rimasgo commented Feb 23, 2023

I noticed that this issue appeared when we upgraded our cluster to k8s version v1.23

getent hosts cannot resolve anything in cluster.local domain

$ time oc exec argocd-redis-ha-server-0 -c config-init -- getent hosts argocd-redis-ha
command terminated with exit code 2

real    0m10.273s
user    0m0.121s
sys     0m0.036s

$ time oc exec argocd-application-controller-0 -- getent hosts argocd-redis-ha
172.30.122.223  argocd-redis-ha.argocd.svc.cluster.local

real    0m0.273s
user    0m0.120s
sys     0m0.040s

@rimasgo
Copy link

rimasgo commented Feb 23, 2023

Seems that network policies argocd-redis-ha-proxy-network-policy and argocd-redis-ha-server-network-policy has to be reviewed. After deleting both policies everything started to work.

I have checked no other network policy has defined ports for DNS and only the above two have port 53 defined which is incorrect (for Openshift). Changed UPD/TCP ports to 5353 and everything came back to life.

@seanmmills
Copy link

Seems that network policies argocd-redis-ha-proxy-network-policy and argocd-redis-ha-server-network-policy has to be reviewed. After deleting both policies everything started to work.

I have checked no other network policy has defined ports for DNS and only the above two have port 53 defined which is incorrect (for Openshift). Changed UPD/TCP ports to 5353 and everything came back to life.

Nice find @rimasgo! I verified this works for our deployment as well via kustomize changes against v2.6.2.

- patch: |-
    - op: replace
      path: /spec/egress/1/ports/0/port
      value: 5353
    - op: replace
      path: /spec/egress/1/ports/1/port
      value: 5353
  target:
    kind: NetworkPolicy
    name: argocd-redis-ha-proxy-network-policy

- patch: |-
    - op: replace
      path: /spec/egress/1/ports/0/port
      value: 5353
    - op: replace
      path: /spec/egress/1/ports/1/port
      value: 5353
  target:
    kind: NetworkPolicy
    name: argocd-redis-ha-server-network-policy

schakrad pushed a commit to schakrad/argo-cd that referenced this issue Mar 14, 2023
…goproj#11388) (argoproj#11862)

* fix(redis): explicit bind to redis and sentinel for IPv4 clusters argoproj#11388

Signed-off-by: rumstead <[email protected]>

* fix(redis): run manifests generate

Signed-off-by: rumstead <[email protected]>

* fix(redis): run manifests generate

Signed-off-by: rumstead <[email protected]>

* Retrigger CI pipeline

Signed-off-by: rumstead <[email protected]>

Signed-off-by: rumstead <[email protected]>
Co-authored-by: Michael Crenshaw <[email protected]>
Signed-off-by: schakrad <[email protected]>
@sc0ttes
Copy link

sc0ttes commented Apr 6, 2023

2.6.7 with OKD 4.12.0 (k8s 1.25.0) doesn't seem to work for me either (using this manifest). Similar to @kilian-hu-freiheit, the redis-ha statefulset and deployment pods never spin up. Appears to be a securityContext issue to me but having tried changing a lot of the variables around the securityContext (and granting 'anyuid' to the project) it still doesn't seem to want to boot the redis servers/proxy up.

Using 2.4.x works luckily.

@yasargil
Copy link

This fixed the problem for us for upgrading 2.4 -> 2.6

Seems that network policies argocd-redis-ha-proxy-network-policy and argocd-redis-ha-server-network-policy has to be reviewed. After deleting both policies everything started to work.
I have checked no other network policy has defined ports for DNS and only the above two have port 53 defined which is incorrect (for Openshift). Changed UPD/TCP ports to 5353 and everything came back to life.

Nice find @rimasgo! I verified this works for our deployment as well via kustomize changes against v2.6.2.

- patch: |-
    - op: replace
      path: /spec/egress/1/ports/0/port
      value: 5353
    - op: replace
      path: /spec/egress/1/ports/1/port
      value: 5353
  target:
    kind: NetworkPolicy
    name: argocd-redis-ha-proxy-network-policy

- patch: |-
    - op: replace
      path: /spec/egress/1/ports/0/port
      value: 5353
    - op: replace
      path: /spec/egress/1/ports/1/port
      value: 5353
  target:
    kind: NetworkPolicy
    name: argocd-redis-ha-server-network-policy

@cehoffman
Copy link
Contributor

cehoffman commented Jun 22, 2023

Stopping by to add where my issue with this symptom came from.

It had to do with the Kubernetes networking setup and the assumption with the HA redis setup of IPv4 networking. My cluster was configured in dual stack mode for IPv4 and IPv6. The IPv6 address range was the first in cluster specification, so it is the IP listed in places that don't show all IPs. Effectively if a Service definition does specify the IP family, it will be single family and IPv6. This is a problem for the HA setup because it defaults to all IPv4 bind addresses in the templated configuration files. Switching them all to IPv6, e.g. bind :: for redis and bind [::]:8888, bind [::]:6379 in HAproxy resolved the issue.

I suspect also changing the ipFamily in the service definitions to IPv4 would also work.

@pre
Copy link

pre commented Aug 3, 2023

Both argocd-redis-ha-server and argocd-redis-ha-haproxy were unable to start in ArgoCD 2.7.10. We were updating from 2.3.12 -> 2.7.10.

Services started after removing the NetworkPolicies argocd-redis-ha-server-network-policy and argocd-redis-ha-proxy-network-policy. I did not inspect yet further why the NetworkPolicy causes the failure, but there's something wrong with it.

redis-ha-server config-init container:

Thu Aug  3 14:51:42 UTC 2023 Start...
Initializing config..
Copying default redis config..
  to '/data/conf/redis.conf'
Copying default sentinel config..
  to '/data/conf/sentinel.conf'
Identifying redis master (get-master-addr-by-name)..
  using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
  Thu Aug  3 14:52:57 UTC 2023 Did not find redis master ()
Identify announce ip for this pod..
  using (argocd-redis-ha-announce-0) or (argocd-redis-ha-server-0)
  identified announce ()
/readonly-config/init.sh: line 239: Error: Could not resolve the announce ip for this pod.: not found

haproxy config-init container:

Waiting for service argocd-redis-ha-announce-0 to be ready (1) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (2) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (3) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (4) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (5) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (6) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (7) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (8) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (9) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (10) ...
Could not resolve the announce ip for argocd-redis-ha-announce-0

@dmpe
Copy link
Contributor

dmpe commented Aug 14, 2023

There are indeed 2 issues:

  • one is network policy as found out by @rimasgo @seanmmills
  • another one are SCC at least in the context of openshift:

(this is potentially insecure - but works...). With this ha redis pods are running "fine".

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app.kubernetes.io/component: redis
    app.kubernetes.io/name: argocd-role-ha-haproxy
    app.kubernetes.io/part-of: argocd
  name: argocd-role-ha-haproxy
  namespace: argocd
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: argocd-role-crb
  namespace: argocd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: argocd-role-ha-haproxy
subjects:
- kind: ServiceAccount
  name: argocd-redis-ha-haproxy
  namespace: argocd
- kind: ServiceAccount
  name: argocd-redis-ha
  namespace: argocd

@adjain131995
Copy link

This is certainly a big issue, I am running argocd on EKS 1.24. In my argocd module network policies do not exist so I have nothing to delete as well as my cluster is purely ipv4 so there is not solution there as well.
I am running v2.7.6 and the only thing that changed in Kubernetes 1.23 to 1.24.
Previously it was working fine

@julian-waibel
Copy link

julian-waibel commented Dec 7, 2023

Here is how I solved my version of this issue.
Edit: Maybe this comment is only relevant for the Helm chart version of Argo CD. However I leave this comment here in hope that it might be useful to somebody.

Issue

When using the argo-cd Helm chart version 5.51.6 (= Argo CD 2.9.3) from https://argoproj.github.io/argo-helm with enabled high availability version through values.yaml:

redis-ha:
  enabled: true

the argocd-redis-ha-haproxy-... pods crash and throw the following errors:

[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:9] for proxy health_check_http_url: cannot create receiving socket (Address family not supported by protocol) for [:::8888]
[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:56] for frontend ft_redis_master: cannot create receiving socket (Address family not supported by protocol) for [:::6379]
[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:77] for frontend stats: cannot create receiving socket (Address family not supported by protocol) for [:::9101]
[ALERT]    (1) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.

Cause and solution

I am running a Rancher RKE2 on-premise cluster which has IPv4/IPv6 dual-stack networking enabled. However it looks like IPv6 was not correctly enabled or is not correctly configured for the cluster. The argo-cd Helm chart uses redis-ha subchart (see https://github.com/argoproj/argo-helm/blob/c3c588038daa7c550bbd977c1298a1fd3f42d7c8/charts/argo-cd/Chart.yaml#L20-L23) which itself uses HAProxy configured to bind and consume IPv6 addresses by default, see https://github.com/DandyDeveloper/charts/blob/e12198606457c7281cd60bd1ed41bd8b0a34cd53/charts/redis-ha/values.yaml#L201C13-L203

In my case it worked to disable this setting by supplying the following values.yaml to the argo-cd Helm chart:

redis-ha:
  enabled: true
+ haproxy:
+   IPv6:
+     enabled: false

@saintmalik
Copy link

This is certainly a big issue, I am running argocd on EKS 1.24. In my argocd module network policies do not exist so I have nothing to delete as well as my cluster is purely ipv4 so there is not solution there as well. I am running v2.7.6 and the only thing that changed in Kubernetes 1.23 to 1.24. Previously it was working fine

you found a solution? having same issues

@mjnovice
Copy link

We see this as well with 2.7.7

@1ocate
Copy link

1ocate commented Jun 3, 2024

Here is how I solved my version of this issue. Edit: Maybe this comment is only relevant for the Helm chart version of Argo CD. However I leave this comment here in hope that it might be useful to somebody.

Issue

When using the argo-cd Helm chart version 5.51.6 (= Argo CD 2.9.3) from https://argoproj.github.io/argo-helm with enabled high availability version through values.yaml:

redis-ha:
  enabled: true

the argocd-redis-ha-haproxy-... pods crash and throw the following errors:

[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:9] for proxy health_check_http_url: cannot create receiving socket (Address family not supported by protocol) for [:::8888]
[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:56] for frontend ft_redis_master: cannot create receiving socket (Address family not supported by protocol) for [:::6379]
[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:77] for frontend stats: cannot create receiving socket (Address family not supported by protocol) for [:::9101]
[ALERT]    (1) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.

Cause and solution

I am running a Rancher RKE2 on-premise cluster which has IPv4/IPv6 dual-stack networking enabled. However it looks like IPv6 was not correctly enabled or is not correctly configured for the cluster. The argo-cd Helm chart uses redis-ha subchart (see https://github.com/argoproj/argo-helm/blob/c3c588038daa7c550bbd977c1298a1fd3f42d7c8/charts/argo-cd/Chart.yaml#L20-L23) which itself uses HAProxy configured to bind and consume IPv6 addresses by default, see https://github.com/DandyDeveloper/charts/blob/e12198606457c7281cd60bd1ed41bd8b0a34cd53/charts/redis-ha/values.yaml#L201C13-L203

In my case it worked to disable this setting by supplying the following values.yaml to the argo-cd Helm chart:

redis-ha:
  enabled: true
+ haproxy:
+   IPv6:
+     enabled: false

It works for me.
Thank you.

@Casper-dss
Copy link

We are still having issues with HA setup. We are using v2.10.12+cb6f5ac. If we close one zone, and try to zync in ArgoCD, it is stuck in "waiting to start". No errors in any logs are reported. This is a major issue, because we cannot do anything in our production environment without ArgoCD, because we are running on a hosted Kubernetes, and our only "admin" access is ArgoCD.

@ML-std
Copy link

ML-std commented Jul 18, 2024

In our case, we had to restart CoreDNS and Cilium agents; after that, the HA worked properly. I hope this helps someone

@pre
Copy link

pre commented Sep 27, 2024

Possibly related: Without maxconn 4096 haproxy eats up all available memory and gets OOM Killed. Pod remains in crashloop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.