Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ip-operator (provider-services 0.2.1) won't self-recover #105

Open
andy108369 opened this issue Jul 6, 2023 · 1 comment
Open

ip-operator (provider-services 0.2.1) won't self-recover #105

andy108369 opened this issue Jul 6, 2023 · 1 comment
Labels
repo/provider Akash provider-services repo issues sev2

Comments

@andy108369
Copy link
Contributor

andy108369 commented Jul 6, 2023

I've deployed the akash-provider with the IP leasing enabled in the sandbox.
IP leasing was working until I've restarted the provider (all-in-one config) and it stopped working.

To fix it I just had to bounce the IP-operator.

Most likely the IP-Operator did not wait long enough for the metallb services (metallb-controller & metallb-speaker) to initialize, hence it's been failing in loop throwing barrier is locked, can't service request messages.

The IP-Operator should figure how to self-recover.

Additional info / Reproducer

I remember to see this problem before, it is easy to reproduce it.
Most likely you just need to stop the metallb services and then bounce the IP-Operator so it falls into barrier is locked, can't service request loop. You can start the metallb services back up again but that won't help, the IP-Operator will keep running in the same broken state until it gets bounced.

We also have documented this issue in the Troubleshooting IP Leases Issues documentation section here https://docs.akash.network/providers/build-a-cloud-provider/ip-leases-provider-enablement-optional-step/troubleshooting-ip-leases-issues

Versions

  • provider & ip-operator: ghcr.io/akash-network/provider:0.2.1
  • Akash node (validator & RPC) 0.22.3

Symptoms

  • provider won't reply upon curl -sk https://provider.sandbox-01.aksh.pw:8443/status (curl will just hang though probing the port using nc -vz provider 8443 succeeds);

Logs

provider

I[2023-07-06|16:49:16.337] using in cluster kube config                 cmp=provider
I[2023-07-06|16:49:16.375] using in cluster kube config                 cmp=provider
D[2023-07-06|16:49:16.382] service being found via autodetection        cmp=provider service=ip-operator
D[2023-07-06|16:49:16.382] service being found via autodetection        cmp=provider service=hostname-operator
I[2023-07-06|16:49:16.383] dns discovery success                        cmp=provider cmp=service-discovery-agent addrs="[{Target:akash-hostname-operator.akash-services.svc.cluster.local. Port:8188 Priority:0 Weight:100}]" portName=status service-name=akash-hostname-operator namespace=akash-services
D[2023-07-06|16:49:16.383] satisfying pending requests                  cmp=provider cmp=service-discovery-agent qty=1
I[2023-07-06|16:49:16.384] check result                                 cmp=provider operator=hostname status=200
I[2023-07-06|16:49:16.384] ready                                        cmp=provider cmp=waiter waitable="<*operatorclients.hostnameOperatorClient 0xc0000cc580>"
I[2023-07-06|16:49:16.384] dns discovery success                        cmp=provider cmp=service-discovery-agent addrs="[{Target:akash-ip-operator.akash-services.svc.cluster.local. Port:8086 Priority:0 Weight:100}]" portName=api service-name=akash-ip-operator namespace=akash-services
D[2023-07-06|16:49:16.384] satisfying pending requests                  cmp=provider cmp=service-discovery-agent qty=1
I[2023-07-06|16:49:16.384] starting with existing reservations          module=provider-cluster cmp=provider cmp=service cmp=inventory-service qty=1
I[2023-07-06|16:49:16.385] check result                                 cmp=provider operator=ip status=503
E[2023-07-06|16:49:16.385] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"
D[2023-07-06|16:49:16.386] found existing hostname                      module=provider-cluster cmp=provider cmp=service hostname=caujtgbgst9uj14p0o5fm6vc8k.ingress.sandbox-01.aksh.pw id=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:49:16.387] found orders                                 module=bidengine-service cmp=provider count=0
I[2023-07-06|16:49:16.388] fetched provider attributes                  module=bidengine-service cmp=provider provider=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:49:18.386] check result                                 cmp=provider operator=ip status=503
E[2023-07-06|16:49:18.386] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"
I[2023-07-06|16:49:20.388] check result                                 cmp=provider operator=ip status=503
E[2023-07-06|16:49:20.388] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"
I[2023-07-06|16:49:22.389] check result                                 cmp=provider operator=ip status=503
E[2023-07-06|16:49:22.389] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"

ip-operator

$ kubectl -n akash-services get pods
NAME                                        READY   STATUS    RESTARTS      AGE
akash-hostname-operator-86d4596d6c-gqmdj    1/1     Running   1 (58m ago)   144m
akash-inventory-operator-69464fbdff-5mlzm   1/1     Running   1 (58m ago)   141m
akash-ip-operator-5d858f8fc8-d7kbj          1/1     Running   1 (58m ago)   76m
akash-provider-0                            1/1     Running   0             2m56s

$ kubectl -n akash-services logs akash-ip-operator-5d858f8fc8-d7kbj
I[2023-07-06|16:47:07.188] using in cluster kube config                 operator=ip
D[2023-07-06|16:47:07.198] service being found via autodetection        operator=ip service=metal-lb
I[2023-07-06|16:47:07.198] using in cluster kube config                 operator=ip
I[2023-07-06|16:47:07.198] clients                                      operator=ip kube="kube client 0xc0015c9f20 ns=lease" metallb="metal LB client 0xc0015ec1e0"
I[2023-07-06|16:47:07.198] HTTP listening                               operator=ip address=0.0.0.0:8086
I[2023-07-06|16:47:07.199] ip operator start                            operator=ip
I[2023-07-06|16:47:07.199] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:47:07.199] fetching existing IP passthroughs            operator=ip
E[2023-07-06|16:47:17.206] dns discovery failed                         operator=ip cmp=service-discovery-agent error="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving" portName=monitoring service-name=controller namespace=metallb-system
D[2023-07-06|16:47:17.206] satisfying pending requests                  operator=ip cmp=service-discovery-agent qty=1
E[2023-07-06|16:47:17.206] observation stopped                          operator=ip err="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving"
I[2023-07-06|16:47:17.206] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:47:17.206] fetching existing IP passthroughs            operator=ip
E[2023-07-06|16:48:17.209] observation stopped                          operator=ip err="context deadline exceeded"
I[2023-07-06|16:48:17.209] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:48:17.209] fetching existing IP passthroughs            operator=ip
E[2023-07-06|16:49:16.385] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:17.212] observation stopped                          operator=ip err="context deadline exceeded"
I[2023-07-06|16:49:17.212] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:49:17.212] fetching existing IP passthroughs            operator=ip
E[2023-07-06|16:49:18.386] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:20.388] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:22.389] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:24.390] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:26.391] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:28.392] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:30.393] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:32.394] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:34.395] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:36.397] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:38.398] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:40.399] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:42.400] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:44.401] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:46.403] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:48.404] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:50.405] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:52.407] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:54.408] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:56.410] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:58.411] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:00.412] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:02.413] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:04.415] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:06.416] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:08.417] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:10.419] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:12.420] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:14.422] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:16.423] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:17.215] observation stopped                          operator=ip err="context deadline exceeded"
I[2023-07-06|16:50:17.215] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:50:17.215] fetching existing IP passthroughs            operator=ip
E[2023-07-06|16:50:18.425] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:20.426] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:22.428] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:24.430] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:26.431] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:28.432] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:30.433] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:32.435] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:34.436] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:36.437] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:38.438] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:40.439] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:42.441] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:44.442] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:46.443] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:48.445] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:50.446] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:52.448] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:54.449] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:56.450] barrier is locked, can't service request     operator=ip path=/health

Workaround

Workaround is to kick the IP Operator (akash-provider does NOT have to be kicked, it will automatically recover once ip-operator recovers):

$ kubectl -n akash-services delete pod akash-ip-operator-5d858f8fc8-d7kbj
pod "akash-ip-operator-5d858f8fc8-d7kbj" deleted
$ kubectl -n akash-services logs akash-ip-operator-5d858f8fc8-n6llv
I[2023-07-06|17:45:45.263] using in cluster kube config                 operator=ip
D[2023-07-06|17:45:45.268] service being found via autodetection        operator=ip service=metal-lb
I[2023-07-06|17:45:45.268] using in cluster kube config                 operator=ip
I[2023-07-06|17:45:45.268] clients                                      operator=ip kube="kube client 0xc000ff5f20 ns=lease" metallb="metal LB client 0xc0010421e0"
I[2023-07-06|17:45:45.268] HTTP listening                               operator=ip address=0.0.0.0:8086
I[2023-07-06|17:45:45.268] ip operator start                            operator=ip
I[2023-07-06|17:45:45.269] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:45.269] fetching existing IP passthroughs            operator=ip
I[2023-07-06|17:45:45.272] dns discovery success                        operator=ip cmp=service-discovery-agent addrs="[{Target:controller.metallb-system.svc.cluster.local. Port:7472 Priority:0 Weight:100}]" portName=monitoring service-name=controller namespace=metallb-system
D[2023-07-06|17:45:45.272] satisfying pending requests                  operator=ip cmp=service-discovery-agent qty=1
I[2023-07-06|17:45:45.274] ip address inventory                         operator=ip in-use=1 available=3
I[2023-07-06|17:45:45.274] starting observation                         operator=ip
I[2023-07-06|17:45:45.276] starting ip passthrough watch                operator=ip client=kube resourceVersion=22730
I[2023-07-06|17:45:45.277] barrier can now be passed                    operator=ip
D[2023-07-06|17:45:45.277] apply event                                  operator=ip event-type=add lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:45.277] connecting                                   operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8 service=supermario externalPort=80
D[2023-07-06|17:45:45.277] ip passthrough has changed, applying         operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
D[2023-07-06|17:45:45.277] Updating ip passthrough                      operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
D[2023-07-06|17:45:45.278] metal LB config change event                 operator=ip client=metallb event-type=ADDED
D[2023-07-06|17:45:45.279] creating metal-lb service                    operator=ip client=metallb service=supermario port=8080 external-port=80 sharing-key=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h-ip-superippersist exists=true
I[2023-07-06|17:45:45.282] update complete                              operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:45.284] ip address inventory                         operator=ip in-use=1 available=3
I[2023-07-06|17:45:45.286] ip address inventory                         operator=ip in-use=1 available=3
D[2023-07-06|17:45:47.486] apply event                                  operator=ip event-type=delete lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:47.506] ip address inventory                         operator=ip in-use=0 available=3
$ curl -sk https://provider.sandbox-01.aksh.pw:8443/status | jq -r
{
  "cluster": {
    "leases": 0,
    "inventory": {
      "available": {
        "nodes": [
          {
            "cpu": 17650,
            "memory": 22853373952,
            "storage_ephemeral": 119292733243
          }
        ],
        "storage": [
          {
            "class": "beta3",
            "size": 130054156288
          }
        ]
      }
    }
  },
  "bidengine": {
    "orders": 0
  },
  "manifest": {
    "deployments": 0
  },
  "cluster_public_hostname": "provider.sandbox-01.aksh.pw",
  "address": "akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8"
}

Additional ideas

Until it gets fixed in the ip-operator, we can try to leverage the livenessProbe/readinessProbe so the pod restarts when it sees the ip-operator hasn't been ready/functioning for longer than 10 minutes or so.

@andy108369
Copy link
Contributor Author

just for the record: this is still an issue with provider 0.5.13

planning on making a workaround until it gets solved in the go code akash-network/helm-charts#276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues sev2
Projects
None yet
Development

No branches or pull requests

1 participant