ip-operator
(provider-services 0.2.1
) won't self-recover
#105
Labels
ip-operator
(provider-services 0.2.1
) won't self-recover
#105
I've deployed the akash-provider with the IP leasing enabled in the
sandbox
.IP leasing was working until I've restarted the provider (all-in-one config) and it stopped working.
To fix it I just had to bounce the IP-operator.
Most likely the IP-Operator did not wait long enough for the metallb services (metallb-controller & metallb-speaker) to initialize, hence it's been failing in loop throwing
barrier is locked, can't service request
messages.The IP-Operator should figure how to self-recover.
Additional info / Reproducer
I remember to see this problem before, it is easy to reproduce it.
Most likely you just need to stop the metallb services and then bounce the IP-Operator so it falls into
barrier is locked, can't service request
loop. You can start the metallb services back up again but that won't help, the IP-Operator will keep running in the same broken state until it gets bounced.We also have documented this issue in the
Troubleshooting IP Leases Issues
documentation section here https://docs.akash.network/providers/build-a-cloud-provider/ip-leases-provider-enablement-optional-step/troubleshooting-ip-leases-issuesVersions
ghcr.io/akash-network/provider:0.2.1
0.22.3
Symptoms
curl -sk https://provider.sandbox-01.aksh.pw:8443/status
(curl will just hang though probing the port usingnc -vz provider 8443
succeeds);Logs
provider
ip-operator
Workaround
Workaround is to kick the IP Operator (akash-provider does NOT have to be kicked, it will automatically recover once ip-operator recovers):
Additional ideas
Until it gets fixed in the ip-operator, we can try to leverage the
livenessProbe
/readinessProbe
so the pod restarts when it sees the ip-operator hasn't been ready/functioning for longer than10 minutes
or so.The text was updated successfully, but these errors were encountered: