Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP Leases: the IP operator does not come online automatically after node restart #76

Closed
shimpa1 opened this issue Feb 21, 2023 · 2 comments
Labels
repo/provider Akash provider-services repo issues sev2

Comments

@shimpa1
Copy link

shimpa1 commented Feb 21, 2023

On my test provider:

  • single bare metal node build
  • built using helm-charts
  • using helm-based RPC node
  • after the worker node restart

The RPC node is in catching up: true state (as expected) and the provider pod is waiting for the RPC node to get to catching up: false state.
Meanwhile the IP-Operator pod is waiting for the provider pod.

When the RPC node catches up with the top of the chain, the provider pod starts however the IP operator pod does not recover.

I[2023-02-21|17:43:25.749] check result                                 cmp=provider operator=ip status=503
E[2023-02-21|17:43:25.749] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc0018bacc0>" error="ip operator is not yet alive"
I[2023-02-21|17:43:27.751] check result                                 cmp=provider operator=ip status=503
E[2023-02-21|17:43:27.751] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc0018bacc0>" error="ip operator is not yet alive"

Manually restarting the IP operator pod works.

Perhaps implement a probe of some sort to check the status of the provider pod before starting the IP operator pod.

cheers,

Shimpa

@andy108369
Copy link
Contributor

Ideally that should be done on the provider side so it can detect when ip operator recovers.

But until that, we can see if livenessProbe/readinessProbe could be leveraged, so the pod restarts when it sees the ip operator hasn't been ready/functioning (from the provider point of view) for longer than 10 minutes or so.

@troian troian added repo/provider Akash provider-services repo issues sev2 labels Mar 1, 2023
@andy108369
Copy link
Contributor

Moved to #105

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues sev2
Projects
None yet
Development

No branches or pull requests

3 participants