Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>)) on send-manifest (on tx deployment update) #152

Open
1 task
andy108369 opened this issue Nov 22, 2023 · 9 comments
Assignees
Labels
repo/provider Akash provider-services repo issues sev1

Comments

@andy108369
Copy link
Contributor

provider-services 0.4.8 (provider & client [CLI])
akash network 0.28.2

I am still seeing this error (err="kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))") on Hurricane provider with k8s v1.27.5 (delivered with kubespray v2.23.0) when sending-manifest to the provider (using the CLI) and that's not limited to the image update in SDL, but also env update.

It is not always happening, but rather sporadically.

Todo

  • need to find a clear reproducer;

Provider logs:

D[2023-11-22|16:45:37.103] running check                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud cmp=deployment-monitor attempt=1
I[2023-11-22|16:45:37.135] check result                                 module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud cmp=deployment-monitor ok=true attempt=1
I[2023-11-22|16:45:47.516] update received                              module=provider-manifest cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605 version=C761DDE12EAAD74D36ACD78EB57DFF035836BD85C162E9B1A071B38313D57BEE
D[2023-11-22|16:45:48.377] running check                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud cmp=deployment-monitor attempt=1
I[2023-11-22|16:45:48.403] check result                                 module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud cmp=deployment-monitor ok=true attempt=1
I[2023-11-22|16:45:54.207] manifest received                            module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605
I[2023-11-22|16:45:54.210] data received                                module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605 version=c761dde12eaad74d36acd78eb57dff035836bd85c162e9b1a071b38313d57bee
D[2023-11-22|16:45:54.210] requests valid                               module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605 num-requests=1
D[2023-11-22|16:45:54.210] publishing manifest received                 module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605 num-leases=1
D[2023-11-22|16:45:54.210] publishing manifest received for lease       module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605 lease_id=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2023-11-22|16:45:54.210] manifest received                            module=provider-cluster cmp=provider cmp=service lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
D[2023-11-22|16:45:54.211] shutting down                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud cmp=deployment-monitor
D[2023-11-22|16:45:54.211] shutdown complete                            module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud cmp=deployment-monitor
I[2023-11-22|16:45:54.219] hostnames withheld                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud cnt=0
E[2023-11-22|16:45:54.219] deploying workload                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud err="kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))"
E[2023-11-22|16:45:54.219] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud state=deploy-active err="kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))"
D[2023-11-22|16:45:54.232] purged ips                                   module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2023-11-22|16:45:54.248] purged hostnames                             module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2023-11-22|16:45:54.248] teardown complete                            module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2023-11-22|16:45:54.248] shutting down                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2023-11-22|16:45:54.248] waiting on dm.wg                             module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
I[2023-11-22|16:45:54.248] shutdown complete                            module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2023-11-22|16:45:54.248] hostnames released                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2023-11-22|16:45:54.248] sending manager into channel                 module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
I[2023-11-22|16:45:54.248] manager done                                 module=provider-cluster cmp=provider cmp=service lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
D[2023-11-22|16:45:54.248] unreserving capacity                         module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1
I[2023-11-22|16:45:54.248] attempting to removing reservation           module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1
I[2023-11-22|16:45:54.248] removing reservation                         module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1
I[2023-11-22|16:45:54.248] unreserve capacity complete                  module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/13697605/1/1
@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Nov 22, 2023
@andy108369 andy108369 self-assigned this Nov 22, 2023
@andy108369 andy108369 changed the title kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>)) on send-manifest kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>)) on send-manifest (on tx deployment update) Dec 6, 2023
@andy108369
Copy link
Contributor Author

still happens with provider 0.5.4, on akash network 0.32.2; have only observed this to happen on the Hurricane provider.

It feels like this issue triggers when provider scans through the leases running check / check result (which is quite constantly happening at high pace on the Hurricane when I look at the provider logs) , and if there is not enough delay between tx update deloyment and send-manifest.

@andy108369
Copy link
Contributor Author

still happens with provider 0.6.2, on akash network 0.36.0

example with 17438710 dseq, kube-builder just errored with ClusterParams() returned result of unexpected type (%!s(<nil>)) upon updating the SDL.

provider logs 152-hurricane.log

$ cat /tmp/152-hurricane.log | grep -Ev 'operator=ip|running check|check result|below target' | grep 17438710
I[2024-08-13|16:23:13.197] update received                              module=provider-manifest cmp=provider deployment=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710 version=7C21B33A56D24DDBDFF34960DF02751567DE89C89EEDF01D9B95A26642879BE1
I[2024-08-13|16:23:22.264] manifest received                            module=manifest-manager cmp=provider deployment=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710
I[2024-08-13|16:23:22.266] data received                                module=manifest-manager cmp=provider deployment=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710 version=7c21b33a56d24ddbdff34960df02751567de89c89eedf01d9b95a26642879be1
D[2024-08-13|16:23:22.267] requests valid                               module=manifest-manager cmp=provider deployment=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710 num-requests=1
D[2024-08-13|16:23:22.267] publishing manifest received                 module=manifest-manager cmp=provider deployment=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710 num-leases=1
D[2024-08-13|16:23:22.267] publishing manifest received for lease       module=manifest-manager cmp=provider deployment=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710 lease_id=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-08-13|16:23:22.267] manifest received                            module=provider-cluster cmp=provider cmp=service lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
D[2024-08-13|16:23:22.267] shutting down                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud cmp=deployment-monitor
D[2024-08-13|16:23:22.267] shutdown complete                            module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud cmp=deployment-monitor
I[2024-08-13|16:23:22.272] hostnames withheld                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud cnt=0
E[2024-08-13|16:23:22.272] deploying workload                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud err="kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))"
E[2024-08-13|16:23:22.272] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud state=deploy-active err="kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))"
D[2024-08-13|16:23:22.276] purged ips                                   module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2024-08-13|16:23:22.297] purged hostnames                             module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2024-08-13|16:23:22.297] teardown complete                            module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2024-08-13|16:23:22.297] shutting down                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2024-08-13|16:23:22.297] waiting on dm.wg                             module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
I[2024-08-13|16:23:22.297] shutdown complete                            module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2024-08-13|16:23:22.297] hostnames released                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
D[2024-08-13|16:23:22.297] sending manager into channel                 module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
I[2024-08-13|16:23:22.297] manager done                                 module=provider-cluster cmp=provider cmp=service lease=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
D[2024-08-13|16:23:22.297] unreserving capacity                         module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1
I[2024-08-13|16:23:22.297] attempting to removing reservation           module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1
I[2024-08-13|16:23:22.297] removing reservation                         module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1
I[2024-08-13|16:23:22.297] unreserve capacity complete                  module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1qh0f0h7jlq4x5gpxghrxvps5l09y7uuvcumcyd/17438710/1/1

@andy108369
Copy link
Contributor Author

the issue is still present in provider 0.6.4
additional logs stored under node2.hurricane.akash.pub:/root/issue-152-logs dir.

image

@andy108369
Copy link
Contributor Author

andy108369 commented Aug 21, 2024

Spotted the same issue on Valdi provider for dseqs 17676873 and 17687779.

Complete provider logs saved under [email protected]:/root/provider-logs-issue-152 dir.

D[2024-08-21|21:22:31.786] running check                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud cmp=deployment-monitor attempt=1
I[2024-08-21|21:22:31.807] check result                                 module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud cmp=deployment-monitor ok=true attempt=1
I[2024-08-21|21:22:41.874] update received                              module=provider-manifest cmp=provider deployment=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873 version=1824113459BC475B447403E58AE0CBF45DB47A89C5E6E295A7F2C27FE3679D56
D[2024-08-21|21:22:43.433] running check                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud cmp=deployment-monitor attempt=1
I[2024-08-21|21:22:43.453] check result                                 module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud cmp=deployment-monitor ok=true attempt=1
I[2024-08-21|21:22:50.428] manifest received                            module=manifest-manager cmp=provider deployment=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873
I[2024-08-21|21:22:50.433] data received                                module=manifest-manager cmp=provider deployment=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873 version=1824113459bc475b447403e58ae0cbf45db47a89c5e6e295a7f2c27fe3679d56
D[2024-08-21|21:22:50.434] requests valid                               module=manifest-manager cmp=provider deployment=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873 num-requests=1
D[2024-08-21|21:22:50.434] publishing manifest received                 module=manifest-manager cmp=provider deployment=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873 num-leases=1
D[2024-08-21|21:22:50.434] publishing manifest received for lease       module=manifest-manager cmp=provider deployment=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873 lease_id=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8
I[2024-08-21|21:22:50.434] manifest received                            module=provider-cluster cmp=provider cmp=service lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8
D[2024-08-21|21:22:50.435] shutting down                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud cmp=deployment-monitor
D[2024-08-21|21:22:50.435] shutdown complete                            module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud cmp=deployment-monitor
I[2024-08-21|21:22:50.441] hostnames withheld                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud cnt=0
E[2024-08-21|21:22:50.441] deploying workload                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud err="kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))"
E[2024-08-21|21:22:50.441] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud state=deploy-active err="kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))"
D[2024-08-21|21:22:50.445] purged ips                                   module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud
D[2024-08-21|21:22:50.452] purged hostnames                             module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud
D[2024-08-21|21:22:50.453] teardown complete                            module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud
D[2024-08-21|21:22:50.453] shutting down                                module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud
D[2024-08-21|21:22:50.453] waiting on dm.wg                             module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud
I[2024-08-21|21:22:50.453] shutdown complete                            module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud
D[2024-08-21|21:22:50.453] hostnames released                           module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud
D[2024-08-21|21:22:50.453] sending manager into channel                 module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud
I[2024-08-21|21:22:50.453] manager done                                 module=provider-cluster cmp=provider cmp=service lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8
D[2024-08-21|21:22:50.453] unreserving capacity                         module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1
I[2024-08-21|21:22:50.453] attempting to removing reservation           module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1
I[2024-08-21|21:22:50.453] removing reservation                         module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1
I[2024-08-21|21:22:50.453] unreserve capacity complete                  module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17676873/1/1
E[2024-08-21|21:17:53.841] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash19jqc8tsdtzvm2zd4mcg0vx9fll4feegfduvpp8/17687779/1/1/akash19ah5c95kq4kz2g6q5rdkdgt80kc3xycsd8plq8 manifest-group=dcloud state=deploy-active err="kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))"

@andy108369
Copy link
Contributor Author

andy108369 commented Aug 22, 2024

Todo: Test provider v0.6.5-rc6

Provider v0.6.5-rc6 has some patches which try to fix this issue.

  • install v0.6.5-rc6 to some of the providers where this issue happens most (which is Hurricane provider) Upd: done for Hurricane on August 22, 2024
  • let v0.6.5-rc6 run there for few weeks and see if the issue occurs again

@andy108369
Copy link
Contributor Author

The first week has been pretty smooth with 0.6.5-rc6 on Hurricane provider! 🚀

@andy108369
Copy link
Contributor Author

@troian let's release v0.6.5-rc6? It's been running well in the past three weeks on the Hurricane provider.

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                                          IMAGE
akash-node-1-0                                                ghcr.io/akash-network/node:0.36.0
akash-provider-0                                              ghcr.io/akash-network/provider:0.6.5-rc6
operator-hostname-79fc5855bb-hk9bc                            ghcr.io/akash-network/provider:0.6.5-rc6
operator-inventory-7cdfdb65d7-msl6c                           ghcr.io/akash-network/provider:0.6.5-rc6
operator-inventory-hardware-discovery-control-01.hurricane2   ghcr.io/akash-network/provider:0.6.5-rc6
operator-inventory-hardware-discovery-worker-01.hurricane2    ghcr.io/akash-network/provider:0.6.5-rc6
operator-ip-796b49c77-k4xgh                                   ghcr.io/akash-network/provider:0.6.5-rc6

$ kubectl -n akash-services get pods -o wide
NAME                                                          READY   STATUS    RESTARTS      AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
akash-node-1-0                                                1/1     Running   1 (44d ago)   44d   10.233.73.131    worker-01.hurricane2    <none>           <none>
akash-provider-0                                              1/1     Running   2 (9d ago)    24d   10.233.73.155    worker-01.hurricane2    <none>           <none>
operator-hostname-79fc5855bb-hk9bc                            1/1     Running   0             24d   10.233.73.161    worker-01.hurricane2    <none>           <none>
operator-inventory-7cdfdb65d7-msl6c                           1/1     Running   0             24d   10.233.73.144    worker-01.hurricane2    <none>           <none>
operator-inventory-hardware-discovery-control-01.hurricane2   1/1     Running   0             24d   10.233.117.178   control-01.hurricane2   <none>           <none>
operator-inventory-hardware-discovery-worker-01.hurricane2    1/1     Running   0             24d   10.233.73.179    worker-01.hurricane2    <none>           <none>
operator-ip-796b49c77-k4xgh                                   1/1     Running   0             24d   10.233.73.181    worker-01.hurricane2    <none>           <none>

$ kubectl -n akash-services logs akash-provider-0 |grep ClusterParams
Defaulted container "provider" out of: provider, init (init)
$ kubectl -n akash-services logs akash-provider-0 --previous |grep ClusterParams
Defaulted container "provider" out of: provider, init (init)

@andy108369
Copy link
Contributor Author

Side note: I've opened a discussion to what might be a potential contributor (not the root cause) to this issue https://github.com/orgs/akash-network/discussions/760

@andy108369
Copy link
Contributor Author

andy108369 commented Jan 9, 2025

It looks like 0.6.5-rc6 might have a bug where it would not update the manifest CRD upon send-manifest while it would update the deployment/statefulset resource which makes it rollback upon provider pod restart.
I'll need to double-check this tomorrow.

Update 1

  • mars FE again got rolledback to the manifest version (image tag 103)
root@control-01:~# kubectl get pods -A | grep -i mars
qd85ugsv5rbkang9n7h76gnfbualgo4vdksbvs07hnsu6   mars-fe-5b6586c956-55pnk                                         1/1     Running     0                17h
root@control-01:~# ns=qd85ugsv5rbkang9n7h76gnfbualgo4vdksbvs07hnsu6
root@control-01:~# kubectl -n $ns get pods -o yaml | grep -w image:
      image: marsprotocol/interface:v2-103
      image: docker.io/marsprotocol/interface:v2-103
root@control-01:~# kubectl -n $ns get rs
NAME                 DESIRED   CURRENT   READY   AGE
mars-fe-578d68fb99   0         0         0       24h
mars-fe-5b6586c956   1         1         1       4d2h
root@control-01:~# kubectl -n lease get manifest $ns
NAME                                            AGE
qd85ugsv5rbkang9n7h76gnfbualgo4vdksbvs07hnsu6   4d2h
root@control-01:~# kubectl -n lease get manifest $ns -o yaml | grep image:
      image: marsprotocol/interface:v2-103
root@control-01:~# kubectl -n $ns get rs mars-fe-578d68fb99 -o yaml | grep image:
        image: marsprotocol/interface:v2-104
root@control-01:~# kubectl -n $ns get rs mars-fe-5b6586c956 -o yaml | grep image:
        image: marsprotocol/interface:v2-103
root@control-01:~# kubectl -n lease get manifest $ns -o yaml | grep image:
      image: marsprotocol/interface:v2-103
root@control-01:~# kubectl -n $ns get pods -o wide
NAME                       READY   STATUS    RESTARTS   AGE   IP              NODE                   NOMINATED NODE   READINESS GATES
mars-fe-5b6586c956-55pnk   1/1     Running   0          17h   10.233.73.250   worker-01.hurricane2   <none>           <none>
root@control-01:~# kubectl -n lease get manifest $ns -o yaml | grep image:
      image: marsprotocol/interface:v2-103
  • after sending manifest once (without tx update)

this is actually 2nd time, but the first time from the akash-provider pod restart

user@laptop:~/git/akash-deployments[https://rpc.akashnet.net:443][engineering][19677969-1-1]$ akash_send_manifest mars.osmosis.zone.yaml 
Detected provider for 19677969/1/1: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
[{"provider":"akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk","status":"PASS"}]
root@control-01:~# kubectl -n lease get manifest $ns -o yaml | grep image:
      image: marsprotocol/interface:v2-103
root@control-01:~# kubectl -n $ns get pods -o wide
NAME                       READY   STATUS    RESTARTS   AGE   IP              NODE                   NOMINATED NODE   READINESS GATES
mars-fe-578d68fb99-lmtvb   0/1     Pending   0          10s   <none>          <none>                 <none>           <none>
mars-fe-5b6586c956-55pnk   1/1     Running   0          17h   10.233.73.250   worker-01.hurricane2   <none>           <none>
  • after sending manifest second time (without tx update)
user@laptop:~/git/akash-deployments[https://rpc.akashnet.net:443][engineering][19677969-1-1]$ akash_send_manifest mars.osmosis.zone.yaml 
Detected provider for 19677969/1/1: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
[{"provider":"akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk","status":"PASS"}]
root@control-01:~# kubectl -n lease get manifest $ns -o yaml | grep image:
      image: marsprotocol/interface:v2-104
root@control-01:~# kubectl -n $ns get pods -o wide
NAME                       READY   STATUS    RESTARTS   AGE   IP              NODE                   NOMINATED NODE   READINESS GATES
mars-fe-578d68fb99-lmtvb   0/1     Pending   0          21s   <none>          <none>                 <none>           <none>
mars-fe-5b6586c956-55pnk   1/1     Running   0          17h   10.233.73.250   worker-01.hurricane2   <none>           <none>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues sev1
Projects
None yet
Development

No branches or pull requests

2 participants