Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider reports excessively high amount of Allocatable cpu & ram when inventory operator hits an ERROR #192

Closed
andy108369 opened this issue Mar 12, 2024 · 13 comments · Fixed by akash-network/helm-charts#271
Assignees
Labels
P0 repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Mar 12, 2024

akash network 0.30.0
provider 0.5.4

Observation

  1. I've installed nvdp/nvidia-device-plugin helm-chart by mistake and then removed after short time:
helm upgrade --install nvdp nvdp/nvidia-device-plugin   --namespace nvidia-device-plugin   --create-namespace   --version 0.14.5   --set runtimeClassName="nvidia"   --set deviceListStrategy=volume-mounts
  1. sometimes provider will report excessively large amount of Allocatable cpu & ram

I reinstalled the operator-inventory, it helped at the first look.
However, after some time I've noticed the issue appeared again:

PROVIDER INFO
"hostname"                    "address"
"provider.sg.lnlm.akash.pub"  "akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"                               "gpu(t/a/u)"  "mem(t/a/u GiB)"                         "ephemeral(t/a/u GiB)"
"node1"  "64/18446744073708244/-18446744073708180"  "0/0/0"       "251.45/17179863965.33/-17179863713.88"  "395.37/395.37/0"
"node2"  "64/18446744073708440/-18446744073708376"  "0/0/0"       "251.45/17179864746.7/-17179864495.25"   "395.37/395.37/0"
"node3"  "64/18446744073708440/-18446744073708376"  "0/0/0"       "251.45/17179864745.56/-17179864494.11"  "395.37/395.37/0"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
31.5          0      126         0                 0             0             31.5

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1661.95

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

Additionally, I've noticed this error in the operator-inventory, but soon figured that it doesn't seem to be the cause comparing to the other providers which seen the same error in their inventory operator:

$ kubectl -n akash-services logs deployment/operator-inventory | grep -v 'MODIFIED monitoring CephCluster'
...
ERROR	watcher.registry	couldn't query pci.ids	{"error": "Get \"\": unsupported protocol scheme \"\""}
...

Provider logs

sg.lnlm.provider.log

Detailed info (8443/status)

sg.lnlm.provider-info-detailed.log

Additional observations

  • provider pod restart doesn't help to recover from this situation
  • only operator-inventory restart seem to have fixed this (and provider pod does not have to be restarted after operator-inventory restart)
  • I let operator-inventory run for over 16 minutes, the issue doesn't appear yet. I'll keep monitoring it.
@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Mar 12, 2024
@andy108369
Copy link
Contributor Author

andy108369 commented Mar 13, 2024

sg.lnlm - provider after 16 hours of uptime

NAME    STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
node1   Ready    control-plane   26d   v1.28.6   192.168.0.100   <none>        Ubuntu 22.04.4 LTS   5.15.0-97-generic   containerd://1.7.13
node2   Ready    control-plane   26d   v1.28.6   192.168.0.101   <none>        Ubuntu 22.04.4 LTS   5.15.0-97-generic   containerd://1.7.13
node3   Ready    <none>          26d   v1.28.6   192.168.0.102   <none>        Ubuntu 22.04.4 LTS   5.15.0-97-generic   containerd://1.7.13

NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          16h

akash-node-9.0.0                0.30.0
provider-9.1.2                  0.5.4
akash-hostname-operator-9.0.5   0.5.4
akash-inventory-operator-9.0.6  0.5.4
ingress-nginx-4.10.0            1.10.0
rook-ceph-v1.13.4               v1.13.4
rook-ceph-cluster-v1.13.4       v1.13.4

PROVIDER INFO
"hostname"                    "address"
"provider.sg.lnlm.akash.pub"  "akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"                               "gpu(t/a/u)"  "mem(t/a/u GiB)"                         "ephemeral(t/a/u GiB)"
"node1"  "64/18446744073709356/-18446744073709292"  "0/0/0"       "251.45/17179868409.45/-17179868158.01"  "395.37/395.37/0"
"node2"  "64/18446744073709400/-18446744073709336"  "0/0/0"       "251.45/17179868584.58/-17179868333.13"  "395.37/395.37/0"
"node3"  "64/18446744073709400/-18446744073709336"  "0/0/0"       "251.45/17179868583.06/-17179868331.61"  "395.37/395.37/0"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
31.5          0      126         0                 0             0             31.5

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1661.88

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
$ kubectl -n akash-services logs operator-inventory-bb568b575-dtflg |grep -v 'MODIFIED monitoring CephCluster'
I[2024-03-12|18:07:29.163] using in cluster kube config                 cmp=provider
INFO	rook-ceph	   ADDED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
INFO	watcher.storageclasses	started
INFO	nodes.nodes	waiting for nodes to finish
INFO	grpc listening on ":8081"
INFO	watcher.config	started
INFO	rest listening on ":8080"
INFO	rook-ceph	   ADDED monitoring StorageClass	{"name": "beta3"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node2"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node3"}
INFO	nodes.node.monitor	starting	{"node": "node2"}
INFO	nodes.node.monitor	starting	{"node": "node1"}
INFO	nodes.node.monitor	starting	{"node": "node3"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node1"}
INFO	rancher	   ADDED monitoring StorageClass	{"name": "beta3"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node3"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node2"}
ERROR	nodes.node.monitor	unable to query cpu	{"error": "error trying to reach service: dial tcp 10.233.75.4:8081: connect: invalid argument"}
ERROR	nodes.node.monitor	unable to query gpu	{"error": "error trying to reach service: dial tcp 10.233.75.4:8081: connect: invalid argument"}
INFO	nodes.node.monitor	started	{"node": "node2"}
INFO	nodes.node.monitor	started	{"node": "node3"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node1"}
INFO	nodes.node.monitor	started	{"node": "node1"}
$ kubectl -n akash-services get pods  -o wide
NAME                                          READY   STATUS    RESTARTS      AGE   IP               NODE    NOMINATED NODE   READINESS GATES
akash-node-1-0                                1/1     Running   1 (12d ago)   27d   10.233.71.36     node3   <none>           <none>
akash-provider-0                              1/1     Running   0             20h   10.233.71.58     node3   <none>           <none>
operator-hostname-cdb556d74-x9kb6             1/1     Running   0             8d    10.233.102.158   node1   <none>           <none>
operator-inventory-bb568b575-dtflg            1/1     Running   0             20h   10.233.75.5      node2   <none>           <none>
operator-inventory-hardware-discovery-node1   1/1     Running   0             20h   10.233.102.143   node1   <none>           <none>
operator-inventory-hardware-discovery-node2   1/1     Running   0             20h   10.233.75.4      node2   <none>           <none>
operator-inventory-hardware-discovery-node3   1/1     Running   0             20h   10.233.71.50     node3   <none>           <none>
$ kubectl -n akash-services logs operator-inventory-hardware-discovery-node2
listening on :8081
$ 

Logs


recovered after operator-inventory restart

$ kubectl rollout restart deployment/operator-inventory -n akash-services
deployment.apps/operator-inventory restarted

$ kubectl -n akash-services get pods  -o wide
NAME                                          READY   STATUS    RESTARTS      AGE   IP               NODE    NOMINATED NODE   READINESS GATES
akash-node-1-0                                1/1     Running   1 (12d ago)   27d   10.233.71.36     node3   <none>           <none>
akash-provider-0                              1/1     Running   0             20h   10.233.71.58     node3   <none>           <none>
operator-hostname-cdb556d74-x9kb6             1/1     Running   0             8d    10.233.102.158   node1   <none>           <none>
operator-inventory-7b5cb44f6c-9w5dn           1/1     Running   0             5s    10.233.75.32     node2   <none>           <none>
operator-inventory-hardware-discovery-node1   1/1     Running   0             3s    10.233.102.187   node1   <none>           <none>
operator-inventory-hardware-discovery-node2   1/1     Running   0             3s    10.233.75.8      node2   <none>           <none>
operator-inventory-hardware-discovery-node3   1/1     Running   0             3s    10.233.71.45     node3   <none>           <none>

$ kubectl -n akash-services logs deployment/operator-inventory -f | grep -v rook
I[2024-03-13|14:34:08.008] using in cluster kube config                 cmp=provider
INFO	nodes.nodes	waiting for nodes to finish
INFO	rest listening on ":8080"
INFO	watcher.storageclasses	started
INFO	watcher.config	started
INFO	grpc listening on ":8081"
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node1"}
INFO	nodes.node.monitor	starting	{"node": "node2"}
INFO	nodes.node.monitor	starting	{"node": "node3"}
INFO	nodes.node.monitor	starting	{"node": "node1"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node2"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node3"}
INFO	rancher	   ADDED monitoring StorageClass	{"name": "beta3"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node1"}
ERROR	nodes.node.monitor	unable to query cpu	{"error": "error trying to reach service: dial tcp 10.233.102.187:8081: connect: connection refused"}
ERROR	nodes.node.monitor	unable to query gpu	{"error": "error trying to reach service: dial tcp 10.233.102.187:8081: connect: connection refused"}
INFO	nodes.node.monitor	started	{"node": "node1"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node2"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node3"}
INFO	nodes.node.monitor	started	{"node": "node3"}
INFO	nodes.node.monitor	started	{"node": "node2"}

recovered:

$ provider_info2.sh provider.sg.lnlm.akash.pub
PROVIDER INFO
"hostname"                    "address"
"provider.sg.lnlm.akash.pub"  "akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"        "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"node1"  "64/46.53/17.47"    "0/0/0"       "251.45/193.45/57.99"  "395.37/395.37/0"
"node2"  "64/46.6/17.4"      "0/0/0"       "251.45/198.58/52.87"  "395.37/395.37/0"
"node3"  "64/45.995/18.005"  "0/0/0"       "251.45/197.06/54.39"  "395.37/395.37/0"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
31.5          0      126         0                 0             0             31.5

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1663.24

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

@andy108369 andy108369 changed the title provider reports excessively amount of Allocatable cpu & ram when inventory operator hits an ERROR provider reports excessively high amount of Allocatable cpu & ram when inventory operator hits an ERROR Mar 13, 2024
@andy108369
Copy link
Contributor Author

clue 1

mon.obl provider reports excessively large amount of gpu for node2 which was
image

there was network attack to this provider earlier today and node2 was powered off for unknown reason.

Here is the current state:

$ kubectl -n akash-services get pods -l app.kubernetes.io/name=inventory
NAME                                           READY   STATUS    RESTARTS      AGE
operator-inventory-bb568b575-mmcjp             1/1     Running   2 (18h ago)   2d3h
operator-inventory-hardware-discovery-node1    1/1     Running   0             18h
operator-inventory-hardware-discovery-node10   1/1     Running   0             18h
operator-inventory-hardware-discovery-node11   1/1     Running   0             18h
operator-inventory-hardware-discovery-node12   1/1     Running   0             18h
operator-inventory-hardware-discovery-node13   1/1     Running   0             18h
operator-inventory-hardware-discovery-node14   1/1     Running   0             18h
operator-inventory-hardware-discovery-node15   1/1     Running   0             18h
operator-inventory-hardware-discovery-node16   1/1     Running   0             18h
operator-inventory-hardware-discovery-node2    1/1     Running   0             6h15m
operator-inventory-hardware-discovery-node3    1/1     Running   0             18h
operator-inventory-hardware-discovery-node4    1/1     Running   0             18h
operator-inventory-hardware-discovery-node5    1/1     Running   0             18h
operator-inventory-hardware-discovery-node6    1/1     Running   0             18h
operator-inventory-hardware-discovery-node7    1/1     Running   0             18h
operator-inventory-hardware-discovery-node8    1/1     Running   0             18h
operator-inventory-hardware-discovery-node9    1/1     Running   0             18h
$ kubectl -n akash-services logs deployment/operator-inventory | grep -v 'MODIFIED monitoring CephCluster'
I[2024-03-14|04:22:51.569] using in cluster kube config                 cmp=provider
INFO	nodes.nodes	waiting for nodes to finish
INFO	watcher.storageclasses	started
INFO	rest listening on ":8080"
INFO	grpc listening on ":8081"
INFO	watcher.config	started
INFO	nodes.node.monitor	starting	{"node": "node10"}
INFO	nodes.node.monitor	starting	{"node": "node1"}
INFO	nodes.node.monitor	starting	{"node": "node12"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node11"}
INFO	nodes.node.monitor	starting	{"node": "node11"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node1"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node12"}
INFO	nodes.node.monitor	starting	{"node": "node14"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node13"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node14"}
INFO	nodes.node.monitor	starting	{"node": "node13"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node10"}
INFO	nodes.node.monitor	starting	{"node": "node16"}
INFO	nodes.node.monitor	starting	{"node": "node15"}
INFO	nodes.node.monitor	starting	{"node": "node2"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node2"}
INFO	nodes.node.monitor	starting	{"node": "node3"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node3"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node16"}
INFO	nodes.node.monitor	starting	{"node": "node4"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node4"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node15"}
INFO	nodes.node.monitor	starting	{"node": "node5"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node5"}
INFO	nodes.node.monitor	starting	{"node": "node6"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node6"}
INFO	nodes.node.monitor	starting	{"node": "node7"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node7"}
INFO	nodes.node.monitor	starting	{"node": "node9"}
INFO	nodes.node.monitor	starting	{"node": "node8"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node9"}
INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node8"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node10"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node3"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node9"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node5"}
INFO	nodes.node.monitor	started	{"node": "node9"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node14"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node13"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node7"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node1"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node4"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node16"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node11"}
INFO	nodes.node.monitor	started	{"node": "node13"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node8"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node12"}
INFO	nodes.node.monitor	started	{"node": "node11"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node6"}
INFO	nodes.node.monitor	started	{"node": "node7"}
INFO	nodes.node.monitor	started	{"node": "node1"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node15"}
INFO	nodes.node.monitor	started	{"node": "node10"}
INFO	nodes.node.monitor	started	{"node": "node12"}
INFO	nodes.node.monitor	started	{"node": "node3"}
INFO	nodes.node.monitor	started	{"node": "node15"}
INFO	nodes.node.monitor	started	{"node": "node14"}
INFO	nodes.node.monitor	started	{"node": "node6"}
INFO	nodes.node.monitor	started	{"node": "node4"}
INFO	nodes.node.monitor	started	{"node": "node8"}
INFO	nodes.node.monitor	started	{"node": "node5"}
INFO	nodes.node.monitor	started	{"node": "node16"}
ERROR	watcher.registry	couldn't query inventory registry	{"error": "Get \"https://provider-configs.akash.network/devices/gpus\": read tcp 10.233.74.86:39682->172.64.80.1:443: read: connection reset by peer"}
ERROR	watcher.registry	couldn't query inventory registry	{"error": "Get \"https://provider-configs.akash.network/devices/gpus\": dial tcp: lookup provider-configs.akash.network on 169.254.25.10:53: read udp 10.233.74.86:58858->169.254.25.10:53: i/o timeout"}
ERROR	watcher.registry	couldn't query inventory registry	{"error": "Get \"https://provider-configs.akash.network/devices/gpus\": read tcp 10.233.74.86:58328->172.64.80.1:443: read: connection reset by peer"}
INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node2"}
INFO	nodes.node.monitor	started	{"node": "node2"}

After bouncing the inventory-operator it normalized:

image

the clue

It seems that the nvdp-nvidia-device-plugin-dgfdg did not have enough time to fully initialize before operator-inventory-hardware-discovery-node2 would assess the amount of GPUs available on the node2.

@deathlessdd
Copy link

deathlessdd commented Mar 16, 2024

Im also having some weird issues. When This happens I cannot bid for gpus on a different node. Fixing it requires bouncing of the operator-inventory

PROVIDER INFO
"hostname"                    "address"
"provider.pcgameservers.com"  "akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"           "gpu(t/a/u)"  "mem(t/a/u GiB)"                        "ephemeral(t/a/u GiB)"
"node1"  "8/5.88/2.12"          "0/0/0"       "7.51/5.87/1.64"                        "43.13/43.13/0"
"node2"  "48/38.575/9.425"      "4/0/4"       "115.12/66.07/49.04"                    "586.82/536.82/50"
"node3"  "128/111.025/16.975"   "0/0/0"       "143.76/96.29/47.47"                    "352.06/193.06/159"
"node4"  "128/44.145/83.855"    "1/1/0"       "52.58/17179869145.17/-17179869092.59"  "290.06/110.44/179.62"
"node5"  "8/3.825/4.175"        "2/1/1"       "52.57/36.05/16.52"                     "453.94/412.03/41.91"
"node6"  "32/18.425/13.575"     "1/0/1"       "47.8/30.14/17.66"                      "175.12/155.12/20"
"node7"  "256/132.275/123.725"  "3/1/2"       "288.16/204.36/83.8"                    "352.06/287.56/64.5"

@andy108369
Copy link
Contributor Author

Narrowing the issue down, based on the providers uptime (~5 days) - it appears that only providers that have or had nvdp/nvidia-device-plugin installed are experiencing this issue.

@andy108369
Copy link
Contributor Author

Couple of additional observations:

  1. whenever I reboot a worker node that has GPU resources, it would more than often (if not always) report excessive amount of allocatable resources unless I restart the inventory operator (kubectl rollout restart deployment/operator-inventory -n akash-services)
  2. most of the times there is this error in the inventory operator logs, after which I would see provider to report excessive amount of allocatable CPU resources:
ERROR	watcher.registry	couldn't query pci.ids	{"error": "Get \"\": unsupported protocol scheme \"\""}

@andy108369
Copy link
Contributor Author

andy108369 commented Mar 20, 2024

Now the Hurricane provider keeps always reporting 18446744073709524 allocatable cpu's even after I restart inventory-operator which usually temporarily fixed the issue, until now.

@deathlessdd
Copy link

deathlessdd commented Mar 21, 2024

New issue after restarting a worker node node5 which had akash-provider-0 and Operator-Inventory running on. I have 11 gpus total, says 7 active but inventory says all 11gpus are used. 0 gpus are pending. Fix by bouncing akash-provider-0 and operator-inventory. Then the inventory started to show correctly again.

PROVIDER INFO
"hostname"                    "address"
"provider.pcgameservers.com"  "akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"           "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"node1"  "8/0.38/7.62"          "0/0/0"       "7.51/0.37/7.14"       "43.13/37.63/5.5"
"node2"  "48/15.575/32.425"     "4/0/4"       "115.12/40.12/74.99"   "586.82/248.06/338.76"
"node3"  "128/114.025/13.975"   "0/0/0"       "143.76/97.89/45.87"   "352.06/249.93/102.13"
"node4"  "128/108.795/19.205"   "1/0/1"       "52.58/12.05/40.52"    "290.06/97.62/192.44"
"node5"  "8/0.525/7.475"        "2/0/2"       "52.57/27.37/25.2"     "453.94/284.24/169.7"
"node6"  "32/18.425/13.575"     "1/0/1"       "47.8/30.14/17.66"     "175.12/155.12/20"
"node7"  "256/108.275/147.725"  "3/0/3"       "288.16/161.36/126.8"  "352.06/217.56/134.5"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
166.5         7      174.31      290.07            0             0             52.5

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          356.66

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
15            0      0.5         0.5               0             0             0

Inventory completely stopped working.

{"cluster":{"leases":12,"inventory":{"active":[{"cpu":4000,"gpu":4,"memory":37580963840,"storage_ephemeral":53687091200},{"cpu":1500,"gpu":0,"memory":5368709120,"storage_ephemeral":8589934592},{"cpu":1000,"gpu":0,"memory":2147483648,"storage_ephemeral":1073741824,"storage":{"beta3":1073741824}},{"cpu":2000,"gpu":0,"memory":16000000000,"storage_ephemeral":100000000000},{"cpu":128000,"gpu":0,"memory":34359738368,"storage_ephemeral":32212254720},{"cpu":4000,"gpu":0,"memory":12884901888,"storage_ephemeral":1610612736},{"cpu":1000,"gpu":0,"memory":8000000000,"storage_ephemeral":30000000000},{"cpu":2000,"gpu":2,"memory":37580963840,"storage_ephemeral":53687091200},{"cpu":4000,"gpu":0,"memory":8589934592,"storage_ephemeral":2147483648,"storage":{"beta3":10737418240}},{"cpu":12000,"gpu":1,"memory":17179869184,"storage_ephemeral":21474836480},{"cpu":5000,"gpu":0,"memory":5368709120,"storage_ephemeral":5368709120,"storage":{"beta3":42949672960}},{"cpu":2000,"gpu":0,"memory":2097741824,"storage_ephemeral":1610612736,"storage":{"beta3":1610612736}}],"available":{"nodes":[{"name":"node1","allocatable":{"cpu":8000,"gpu":0,"memory":8068288512,"storage_ephemeral":46314425473},"available":{"cpu":380,"gpu":0,"memory":400015360,"storage_ephemeral":40408845441}},{"name":"node2","allocatable":{"cpu":48000,"gpu":4,"memory":123604434944,"storage_ephemeral":630096038893},"available":{"cpu":15575,"gpu":0,"memory":43081193472,"storage_ephemeral":266350410733}},{"name":"node3","allocatable":{"cpu":128000,"gpu":0,"memory":154365534208,"storage_ephemeral":378025411573},"available":{"cpu":114025,"gpu":0,"memory":105111463936,"storage_ephemeral":268361735157}},{"name":"node4","allocatable":{"cpu":128000,"gpu":1,"memory":56455852032,"storage_ephemeral":311444659299},"available":{"cpu":108795,"gpu":0,"memory":12943570944,"storage_ephemeral":104814129251}},{"name":"node5","allocatable":{"cpu":8000,"gpu":2,"memory":56443244544,"storage_ephemeral":487414664409},"available":{"cpu":525,"gpu":0,"memory":29388990464,"storage_ephemeral":305202409689}},{"name":"node6","allocatable":{"cpu":32000,"gpu":1,"memory":51326119936,"storage_ephemeral":188036982064},"available":{"cpu":18425,"gpu":0,"memory":32362043392,"storage_ephemeral":166562145584}},{"name":"node7","allocatable":{"cpu":256000,"gpu":3,"memory":309405798400,"storage_ephemeral":378025411573},"available":{"cpu":108275,"gpu":0,"memory":173257062400,"storage_ephemeral":233607136245}}],"storage":[{"class":"beta3","size":382961909760}]}}},"bidengine":{"orders":0},"manifest":{"deployments":0},"cluster_public_hostname":"provider.pcgameservers.com","address":"akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"}

After starting both services for "akash-provider-0, Operatory-inventory"

PROVIDER INFO
"hostname"                    "address"
"provider.pcgameservers.com"  "akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"           "gpu(t/a/u)"  "mem(t/a/u GiB)"      "ephemeral(t/a/u GiB)"
"node1"  "8/5.38/2.62"          "0/0/0"       "7.51/5.62/1.89"      "43.13/43.13/0"
"node2"  "48/38.575/9.425"      "4/0/4"       "115.12/66.07/49.04"  "586.82/536.82/50"
"node3"  "128/114.025/13.975"   "0/0/0"       "143.76/97.89/45.87"  "352.06/249.93/102.13"
"node4"  "128/93.795/34.205"    "1/1/0"       "52.58/7.55/45.02"    "290.06/227.62/62.44"
"node5"  "8/6.025/1.975"        "2/2/0"       "55.45/53.45/2"       "453.94/453.94/0"
"node6"  "32/18.425/13.575"     "1/0/1"       "47.8/30.14/17.66"    "175.12/155.12/20"
"node7"  "256/114.275/141.725"  "3/1/2"       "288.16/196.36/91.8"  "352.06/267.56/84.5"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
166.5         7      174.31      290.07            0             0             52.5

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          396.65

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

@andy108369
Copy link
Contributor Author

Now the Hurricane provider keeps always reporting 18446744073709524 allocatable cpu's even after I restart inventory-operator which usually temporarily fixed the issue, until now.

Fixed the Hurricane reporting. Possibly it was caused by some of the deployments in Failed state.
I've cleaned them up after which reporting looks good:

arno@x1:~$ kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide --field-selector status.phase=Failed 
NAMESPACE                                       NAME                           READY   STATUS                   RESTARTS        AGE     IP       NODE                   NOMINATED NODE   READINESS GATES
hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m   miner-xmrig-7877f4f8d9-9txlz   0/1     Error                    1               17d     <none>   worker-01.hurricane2   <none>           <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-f7785f6c6-p6m69            0/1     ContainerStatusUnknown   2 (13d ago)     14d     <none>   worker-01.hurricane2   <none>           <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-f7785f6c6-twxrr            0/1     ContainerStatusUnknown   1 (11d ago)     11d     <none>   worker-01.hurricane2   <none>           <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-f7785f6c6-5dgbl            0/1     ContainerStatusUnknown   2 (6d22h ago)   7d16h   <none>   worker-01.hurricane2   <none>           <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-f7785f6c6-95b2k            0/1     ContainerStatusUnknown   1               5d6h    <none>   worker-01.hurricane2   <none>           <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-f7785f6c6-9nh5b            0/1     ContainerStatusUnknown   1               4d17h   <none>   worker-01.hurricane2   <none>           <none>

arno@x1:~$ kubectl -n 2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu get rs
NAME            DESIRED   CURRENT   READY   AGE
web-f7785f6c6   1         1         1       14d

arno@x1:~$ kubectl -n 2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu get pods
NAME                  READY   STATUS                   RESTARTS        AGE
web-f7785f6c6-2r2j6   0/1     Completed                0               2d11h
web-f7785f6c6-4m462   0/1     Completed                2 (5d18h ago)   6d10h
web-f7785f6c6-5dgbl   0/1     ContainerStatusUnknown   2 (6d22h ago)   7d16h
web-f7785f6c6-95b2k   0/1     ContainerStatusUnknown   1               5d6h
web-f7785f6c6-9nh5b   0/1     ContainerStatusUnknown   1               4d17h
web-f7785f6c6-dsjp8   0/1     Completed                7 (8d ago)      11d
web-f7785f6c6-fl49h   0/1     Completed                0               3d4h
web-f7785f6c6-g2sfx   0/1     Completed                2 (12d ago)     12d
web-f7785f6c6-j2prf   1/1     Running                  4 (86m ago)     2d1h
web-f7785f6c6-p6m69   0/1     ContainerStatusUnknown   2 (13d ago)     14d
web-f7785f6c6-pk98k   0/1     Completed                3 (3d13h ago)   4d5h
web-f7785f6c6-q89gg   0/1     Completed                0               8d
web-f7785f6c6-twxrr   0/1     ContainerStatusUnknown   1 (11d ago)     11d
web-f7785f6c6-z8w9f   0/1     Completed                0               2d20h

arno@x1:~$ kubectl -n hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m get rs
NAME                     DESIRED   CURRENT   READY   AGE
miner-xmrig-7877f4f8d9   1         1         1       21d

arno@x1:~$ kubectl -n hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m get pods
NAME                           READY   STATUS    RESTARTS      AGE
miner-xmrig-7877f4f8d9-8mfmt   1/1     Running   1 (86m ago)   7d15h
miner-xmrig-7877f4f8d9-9txlz   0/1     Error     1             17d

arno@x1:~$ kubectl delete pods -A --field-selector status.phase=Failed 
pod "web-f7785f6c6-5dgbl" deleted
pod "web-f7785f6c6-95b2k" deleted
pod "web-f7785f6c6-9nh5b" deleted
pod "web-f7785f6c6-p6m69" deleted
pod "web-f7785f6c6-twxrr" deleted
pod "miner-xmrig-7877f4f8d9-9txlz" deleted

arno@x1:~$ kubectl -n hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m get pods
NAME                           READY   STATUS    RESTARTS      AGE
miner-xmrig-7877f4f8d9-8mfmt   1/1     Running   1 (86m ago)   7d15h

arno@x1:~$ kubectl -n 2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu get pods
NAME                  READY   STATUS      RESTARTS        AGE
web-f7785f6c6-2r2j6   0/1     Completed   0               2d11h
web-f7785f6c6-4m462   0/1     Completed   2 (5d18h ago)   6d10h
web-f7785f6c6-dsjp8   0/1     Completed   7 (8d ago)      11d
web-f7785f6c6-fl49h   0/1     Completed   0               3d4h
web-f7785f6c6-g2sfx   0/1     Completed   2 (12d ago)     12d
web-f7785f6c6-j2prf   1/1     Running     4 (86m ago)     2d1h
web-f7785f6c6-pk98k   0/1     Completed   3 (3d13h ago)   4d5h
web-f7785f6c6-q89gg   0/1     Completed   0               8d
web-f7785f6c6-z8w9f   0/1     Completed   0               2d20h
$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Allocatable/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"         "gpu(t/a/u)"  "mem(t/a/u GiB)"      "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"          "0/0/0"       "1.82/1.69/0.13"      "25.54/25.54/0"
"worker-01.hurricane2"   "102/18.795/83.205"  "1/1/0"       "196.45/102.2/94.25"  "1808.76/1548.18/260.58"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
63            0      59.73       236.68            0             0             0

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          276.44

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
1.8           0      8           10                0             0             0

@andy108369
Copy link
Contributor Author

andy108369 commented Mar 30, 2024

it appears to be still an issue with provider-services v0.5.9 & v0.5.11:

$ curl -s -k https://provider.sg.lnlm.akash.pub:8443/status | jq -r . | grep -C1 1844
            "available": {
              "cpu": 18446744073709520000,
              "gpu": 0,
              "memory": 18446743950448112000,
              "storage_ephemeral": 424525602114
--
            "available": {
              "cpu": 18446744073709496000,
              "gpu": 0,
              "memory": 18446743844816663000,
              "storage_ephemeral": 424525602114
--
            "available": {
              "cpu": 18446744073709537000,
              "gpu": 0,
              "memory": 18446744023303657000,
              "storage_ephemeral": 424525602114

@andy108369
Copy link
Contributor Author

(pdx.nb.akash.pub 4090s provider)
It appears that this has something to do with the failing pods too.
For instance, 3 nodes with 8x 4090's each.

The node1,node2 had wrong nvidia.ko driver version installed 550 instead of 535.
I've reinstalled it while these deployments were running, restarted all 3 nodes.

This caused some pods stuck in ContainerStatusUnknown state:

$ kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide --field-selector status.phase=Failed 
NAMESPACE                                       NAME                         READY   STATUS                   RESTARTS   AGE   IP       NODE    NOMINATED NODE   READINESS GATES
3n3mvl6qqh1bkk41pou3dkfttkjglo6tua36udmh6n4fm   service-1-dd746bf44-4745c    0/1     ContainerStatusUnknown   0          43m   <none>   node1   <none>           <none>
1qhlsoi0sqj2rot1otov7vhfao2j0cnmbuvkj2qd16ese   service-1-76f8b9cf6d-dp7x2   0/1     ContainerStatusUnknown   0          33m   <none>   node2   <none>           <none>
va7h4phadmnfld29qd5rdtdsgk6eupf39d6etke05t0fu   service-1-7c59bdb7df-j586f   0/1     ContainerStatusUnknown   0          32m   <none>   node2   <none>           <none>
qg9lq6q8tcta1p2m9fuc1pdbjfispht8q7e7iun6t5s2e   service-1-59974dfd89-sgvjg   0/1     Unknown                  0          32m   <none>   node2   <none>           <none>
6ng2gu6vf5p8qg5bde5udse1e34igb1bn15kaeupiuhva   service-1-59c44cd758-mzvsj   0/1     Unknown                  0          30m   <none>   node2   <none>           <none>
d62adnou0v7b5s7h3t8gnh0av540fcok9bk56u72f3je2   service-1-7cffd45f48-w25vt   0/1     ContainerStatusUnknown   0          28m   <none>   node2   <none>           <none>
of3uincqjlja5ekk8cbfuormpm4dmn8v403c535f4dc4m   service-1-7dd5dffbc4-brgvg   0/1     ContainerStatusUnknown   0          25m   <none>   node1   <none>           <none>
rfij4esvggf9cqqnpf2hq266o0nba01t5iq918bu1v9iu   service-1-58bf676fdc-ph4f9   0/1     ContainerStatusUnknown   0          23m   <none>   node2   <none>           <none>
6pgohd98lm7gs5rb2kv5bnc4c9920jtvfvmg4ikqvhn8a   service-1-55858fc545-6fr5l   0/1     ContainerStatusUnknown   0          22m   <none>   node1   <none>           <none>
guf6r2fhenpfljbhncip9sbei3ss43av4kaau95kl4rpq   service-1-598c857c89-wtkvh   0/1     ContainerStatusUnknown   0          22m   <none>   node2   <none>           <none>
tsu3ue9housp0ehjsr51psu4aambpaqvtpuninpl07hqs   service-1-55fd66f6f5-2hfqv   0/1     ContainerStatusUnknown   0          21m   <none>   node1   <none>           <none>
qrka4mab6esns6blt8jaeos663j0e6sbp9cfghi02jc4i   service-1-84c67b446-5v55h    0/1     ContainerStatusUnknown   0          20m   <none>   node1   <none>           <none>
eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g   service-1-84988c5fb6-8rhtq   0/1     ContainerStatusUnknown   0          20m   <none>   node1   <none>           <none>
g5i1ml6bhnfso9faglp1gv167f8acegsv335hlkq0dlfc   service-1-7d66cdd98c-572vx   0/1     ContainerStatusUnknown   0          18m   <none>   node1   <none>           <none>

Which in turn triggered this bug (see GPU count for node1,node2):

$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname"                   "address"
"provider.pdx.nb.akash.pub"  "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"

Total/Available/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"          "gpu(t/a/u)"                                    "mem(t/a/u GiB)"        "ephemeral(t/a/u GiB)"
"node1"  "128/21.38/106.62"    "0/18446744073709552000/-18446744073709552000"  "503.61/401.45/102.16"  "6385.77/5515.21/870.55"
"node2"  "128/24.12/103.88"    "8/18446744073709552000/-18446744073709552000"  "503.61/375.96/127.65"  "6385.77/4059.33/2326.43"
"node3"  "128/22.425/105.575"  "8/1/7"                                         "503.61/400.99/102.62"  "6385.77/5515.21/870.55"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
149           14     152.74      2033.77           0             0             0

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1699.04

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

I've deleted those that stuck in ContainerStatusUnknown and stats immediately recovered:

kubectl delete pods -A --field-selector status.phase=Failed 
$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname"                   "address"
"provider.pdx.nb.akash.pub"  "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"

Total/Available/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"          "gpu(t/a/u)"  "mem(t/a/u GiB)"        "ephemeral(t/a/u GiB)"
"node1"  "128/57.38/70.62"     "8/4/4"       "503.61/434.97/68.64"   "6385.77/5938.73/447.03"
"node2"  "128/73.12/54.88"     "8/1/7"       "503.61/435.56/68.05"   "6385.77/5222.55/1163.22"
"node3"  "128/22.425/105.575"  "8/1/7"       "503.61/400.99/102.62"  "6385.77/5515.21/870.55"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
181           16     182.54      2257.29           0             0             0

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1699.04

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
16            1      14.9        111.76            0             0             0

@andy108369
Copy link
Contributor Author

andy108369 commented Apr 10, 2024

pdx.nb provider - issue happened in under 55 mins after operator-inventory restart

I think the pdx.nb provider is a good candidate to start monitoring the issue, since it appears this issue started to occur often after node1.pdx.nb.akash.pub node has been replaced yesterday (the mainboard, GPU's & ceph disk), except for the main OS disks (rootfs).

$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname"                   "address"
"provider.pdx.nb.akash.pub"  "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"

Total/Available/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"         "gpu(t/a/u)"  "mem(t/a/u GiB)"        "ephemeral(t/a/u GiB)"
"node1"  "128/11.65/116.35"   "8/1/7"       "503.59/240.78/262.82"  "6385.77/5603.46/782.31"
"node2"  "128/39.95/88.05"    "8/0/8"       "503.61/418.51/85.1"    "6385.77/5138.73/1247.03"
"node3"  "128/6.325/121.675"  "8/0/8"       "503.61/368.93/134.68"  "6385.77/5403.46/982.31"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
356           26     583.64      3346.93           0             0             770

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          846.13

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname"                   "address"
"provider.pdx.nb.akash.pub"  "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"

Total/Available/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"                                "gpu(t/a/u)"                                    "mem(t/a/u GiB)"        "ephemeral(t/a/u GiB)"
"node1"  "128/18446744073709468/-18446744073709340"  "8/18446744073709552000/-18446744073709552000"  "503.59/16.78/486.82"   "6385.77/4932.9/1452.86"
"node2"  "128/39.95/88.05"                           "8/0/8"                                         "503.61/418.51/85.1"    "6385.77/5138.73/1247.03"
"node3"  "128/6.325/121.675"                         "8/0/8"                                         "503.61/368.93/134.68"  "6385.77/5403.46/982.31"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
356           26     583.64      3346.93           0             0             770

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          846.13

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
$ kubectl -n akash-services get pods 
NAME                                          READY   STATUS    RESTARTS        AGE
akash-node-1-0                                1/1     Running   2 (5d16h ago)   5d17h
akash-provider-0                              1/1     Running   0               15h
operator-hostname-574d8699d-c22w5             1/1     Running   3 (3d11h ago)   5d17h
operator-inventory-75df5b6fb5-2k897           1/1     Running   0               55m
operator-inventory-hardware-discovery-node1   1/1     Running   0               55m
operator-inventory-hardware-discovery-node2   1/1     Running   0               55m
operator-inventory-hardware-discovery-node3   1/1     Running   0               55m
$ kubectl -n akash-services logs deployment/operator-inventory --timestamps
2024-04-10T09:19:38.827373163Z I[2024-04-10|09:19:38.827] using in cluster kube config                 cmp=provider
2024-04-10T09:19:39.849250140Z INFO	rook-ceph	   ADDED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:19:39.874528936Z INFO	nodes.nodes	waiting for nodes to finish
2024-04-10T09:19:39.874554196Z INFO	grpc listening on ":8081"
2024-04-10T09:19:39.874575036Z INFO	watcher.storageclasses	started
2024-04-10T09:19:39.874577926Z INFO	watcher.config	started
2024-04-10T09:19:39.874582506Z INFO	rest listening on ":8080"
2024-04-10T09:19:39.877335127Z INFO	rook-ceph	   ADDED monitoring StorageClass	{"name": "beta3"}
2024-04-10T09:19:39.878384595Z INFO	nodes.node.monitor	starting	{"node": "node1"}
2024-04-10T09:19:39.878393235Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node1"}
2024-04-10T09:19:39.878405655Z INFO	nodes.node.monitor	starting	{"node": "node2"}
2024-04-10T09:19:39.878415995Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node2"}
2024-04-10T09:19:39.878422595Z INFO	nodes.node.monitor	starting	{"node": "node3"}
2024-04-10T09:19:39.878454075Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node3"}
2024-04-10T09:19:39.885778559Z INFO	rancher	   ADDED monitoring StorageClass	{"name": "beta3"}
2024-04-10T09:19:42.756543562Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node3"}
2024-04-10T09:19:42.916743883Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node2"}
2024-04-10T09:19:43.183058728Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node1"}
2024-04-10T09:19:43.202991245Z INFO	nodes.node.monitor	started	{"node": "node2"}
2024-04-10T09:19:43.426359532Z INFO	nodes.node.monitor	started	{"node": "node1"}
2024-04-10T09:19:44.084855728Z INFO	nodes.node.monitor	started	{"node": "node3"}
2024-04-10T09:19:49.948809750Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:20:50.587788813Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:21:51.246518697Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:22:51.918597738Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:23:52.573545877Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:24:53.223854209Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:25:53.901694220Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:26:54.559292611Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:27:55.217895603Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:28:55.871364360Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:29:56.526719703Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:30:57.202289863Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:31:57.853717274Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:32:58.516702894Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:33:59.173653862Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:34:59.835751168Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:36:00.494909031Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:37:01.155779061Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:38:01.817805371Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:39:02.475386563Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:40:03.139422647Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:41:03.806580305Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:42:04.457760854Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:43:05.106685230Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:44:05.749807485Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:45:06.406388585Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:46:07.065620189Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:47:07.734040162Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:48:08.401818233Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:49:09.064330123Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:50:09.719065483Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:51:10.362797022Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:52:11.025336455Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:53:11.682972539Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:54:12.349148024Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:55:13.014021018Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:56:13.689351710Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:57:14.351208230Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:58:15.017461998Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:59:15.672108483Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:00:16.317829774Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:01:16.969684860Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:02:17.638128924Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:03:18.284770571Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:04:18.947803687Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:05:19.593905389Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:06:20.263411687Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:07:20.903124077Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:08:21.553959836Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:09:22.205702368Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:10:22.857968005Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:11:23.532570698Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:12:24.181899391Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:13:24.839654285Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:14:25.493307725Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}

this might be also the trigger:

NAMESPACE                                       LAST SEEN   TYPE      REASON              OBJECT                                            MESSAGE
sj264h0mg6bq9alvkqtnq69ubjd9ptq4ubdfkuc9i6rdm   14m         Warning   FailedScheduling    pod/service-1-0                                   0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
akash-services                                  60m         Normal    Scheduled           pod/operator-inventory-75df5b6fb5-2k897           Successfully assigned akash-services/operator-inventory-75df5b6fb5-2k897 to node2
eg0vmr4qmf9kdumohtdahhqq14aa3i5q1dutblo0jugc2   14m         Warning   FailedScheduling    pod/service-1-0                                   0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
eip4fok1c0g4eome40s2r4u3941at4sua6rlh842c07dq   14m         Warning   FailedScheduling    pod/service-1-0                                   0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..```

FWIW, `11/26` leases are using `beta3` persistent storage.
I know how to contact the owner of most of all the 26 leases on pdx.nb provider if needed.

@andy108369
Copy link
Contributor Author

andy108369 commented Apr 11, 2024

sg.lnlm.akash.pub - issue got triggered

Looks like this triggered the "excessively large stats" issue on the sg.lnlm.akash.pub:

2024-04-11T08:10:21.154985407Z ERROR	watcher.registry	couldn't query pci.ids	{"error": "Get \"\": unsupported protocol scheme \"\""}

complete logs with the timestamps:
sg.lnlm.akash.pub.deployment-operator-inventory.log

Additionally

there haven't been lease-created nor lease-closed for this provider in the past week.

  • lease-created
$ provider-services query txs --events "akash.v1.provider=akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs&akash.v1.module=market&akash.v1.action=lease-created" --page 1 --limit 100 -o json | jq -r '.txs[] | [.timestamp, .height, .txhash, .code, (.tx.body.messages[] | ."@type"), (.logs[].events[].attributes[] | (select(.key == "action") | .value), (select(.key == "dseq") | .value), (select(.key == "provider") | .value), (select(.key == "price-amount") | .value))] | @csv'
  • lease-closed
provider-services query txs --events "akash.v1.provider=akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs&akash.v1.module=market&akash.v1.action=lease-closed" --page 1 --limit 100 -o json | jq -r '.txs[] | [.timestamp, .height, .txhash, .code, (.tx.body.messages[] | ."@type"), (.logs[].events[].attributes[] | (select(.key == "action") | .value), (select(.key == "dseq") | .value), (select(.key == "provider") | .value), (select(.key == "price-amount") | .value))] | @csv'

however, there have been some bid-created & bid-closed events just today:

  • bid-created
$ provider-services query txs --events "akash.v1.provider=akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs&akash.v1.module=market&akash.v1.action=bid-created" --page 1 --limit 100 -o json | jq -r '.txs[] | [.timestamp, .height, .txhash, .code, (.tx.body.messages[] | ."@type"), (.logs[].events[].attributes[] | (select(.key == "action") | .value), (select(.key == "dseq") | .value), (select(.key == "provider") | .value), (select(.key == "price-amount") | .value))] | @csv'
...
"2024-04-05T07:27:16Z","15741103","96056FFA015683ECAD1DC2C8E19FC886B552F04712BBA43B07DF446CD3E910B8",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15741100","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.439409000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-10T08:26:10Z","15813611","84BC3A4364E8E5D6B7D843C094009FE4076A7FE9721F126B69AD7D146B1913C9",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15813608","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.426759000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-10T10:12:10Z","15814669","CD0ABD45313F9C0EE96359D6D0197E92E1CCF1359B4B146ACEAF7AEEEB009B4D",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15814666","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.430669000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T05:36:13Z","15826304","5EA500D96198A996371D2D304BA75A720D6D89C428A263BB2F4E19DA9CAD829D",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15826301","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.386220000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:45:06Z","15827589","79B2A29D9CA2CB9BB1FC9D242B7963A6F86BE318C1189ABEA0544273F807DDE6",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827587","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:46:06Z","15827599","744DBD61B2E7EF42EB6DE1D992AED30CD1A5BEADA676F21D336B9EF37AAE2E9D",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827597","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:47:07Z","15827609","1FC2EB06D72087FD033EA5090F65C945AAB88D832995F8DDC4CE5AC3710834AA",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827607","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:48:19Z","15827621","AA70E01582A99C6FE9752B28084B05A9BCEEF1B27A6006B5711581AEC0668DA5",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827617","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:49:07Z","15827629","66E5102E111D40E33C5CCA6EF5CA48F07B51ABAB4E497C817F3DB7646D6F9A72",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827627","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:50:20Z","15827641","774A22CD3D09CCA86968B07E1D6DE89C4C190645B541C73F5B56BE0D55B6C18A",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827638","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:51:20Z","15827651","273BEB331C3392D5D3F0CE04A9B8503DCCF95DDC7E519B90CC6C258DA06CD88B",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827648","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:52:20Z","15827661","8DBAB0C611FA344EAEE61B5270A0284EF5C6B4865C0288C8DCEDAF9A0BAF9818",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827658","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:53:17Z","15827670","A97A93C446D05E85E1277EF5451329F2F229C6694AA9836BA7C012DCF9600FE2",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827668","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:54:17Z","15827680","3BDF8918413223D81E5802FAA7307657276AB9DDB60BE792CD2FFF7B8DF97F64",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827678","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:55:33Z","15827692","054DA97FD371249CD17069D197992469A6F8EB39DB14400EC901E5A11912276B",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827689","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:56:32Z","15827702","EFAA5C8E885E211D0109202217E8AC3A9D05D3381406208F0C8CC65FC573E8F4",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827699","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:57:32Z","15827712","17E2C69CF1BAE820FF25C00CF7CBD0640A5A7A202872A93371D3554B55C90099",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827709","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:58:31Z","15827722","56498D973432ECC50E19029BA23269AB77CE44A2E172BD3597D3753F1B2F5A7C",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827720","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:59:32Z","15827732","BA0CAB4E094D1BFC24F4B5BBA7A21EA5EEA42912F6BFD862D8DB83253497B17D",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827730","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T08:00:52Z","15827745","66E8BD7FDF6EFFF86FC247EE1BB1503891C94E3EE53426C1D94377AF64916AC9",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827742","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T08:01:50Z","15827754","177463C0403E1C21BA5E5B9AC88D21264BA338509DFC8AB462DAD424D8DFC40E",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827751","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T08:02:49Z","15827764","14FCD0FF5C688C1E66F7D25229E6DE10AD9F8429DC32524235CE3D07251649DC",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827762","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T08:03:19Z","15827769","7356195DD0C8E5F5AC77552C3B654A0012EDF12B08FC438701CF5BB26DE6C045",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827766","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T08:03:38Z","15827772","CE57B2AB2612A2136B1BAFCDC6DC618634D6D8BE7E503CAFBC8BE3197AD46A44",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827770","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
  • bid-closed
$ provider-services query txs --events "akash.v1.provider=akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs&akash.v1.module=market&akash.v1.action=bid-closed" --page 1 --limit 100 -o json | jq -r '.txs[] | [.timestamp, .height, .txhash, .code, (.tx.body.messages[] | ."@type"), (.logs[].events[].attributes[] | (select(.key == "action") | .value), (select(.key == "dseq") | .value), (select(.key == "provider") | .value), (select(.key == "price-amount") | .value))] | @csv'
...
"2024-03-25T13:27:13Z","15585782","40182C4E571FA08B191477AB133E0476FD4340896AE83BF4EAD73CF17079E436",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15585729","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.023612000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-03-29T13:39:57Z","15643714","23F0F6A6F91E9CA5B7DF9D5D00016BA9CF99EC70007995B9381F2363BA6629A1",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15643660","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.076710000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:50:08Z","15827639","65A7CD516C85A4AD78AEED5A172A334C4211B1657BDABE461A6D79B3CF1DBB3E",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827587","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:51:14Z","15827650","7F87BB1729137662B4BB83F4D433874D83214321157EDBAB96B8966F387656BB",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827597","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:52:14Z","15827660","021A2C39A5325FB64CC82DE6FFB72388DC73779F6B4BA92215E4AF9689E74C2D",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827607","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:53:23Z","15827671","F84104F7B3D92762AA7AD7BD934B1718FBBCFFB98330BE25B488E60C19593A28",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827617","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:54:11Z","15827679","2B5B6C5E421DE0391805B8A6F2E02CAFF63B180B4E28A66EFC68EB4B83CA8D72",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827627","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:55:27Z","15827691","3DC611BF65C1950AAC75FA6F5C91FA7F82000450F1B9D5ADF539D001081E0D6F",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827638","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:56:26Z","15827701","90A4C06DFE401B45D6E24EB30687A25297A559A718065E700074BEC266A3022C",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827648","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:57:26Z","15827711","097BD4458BE4D85A92304AAC44E48AEB1A1D40C951F2457A55122D425DAE5DC0",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827658","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:58:19Z","15827720","B921C408906F0DDCD0093C588AEBAAC3B36AA47193561E8B091FC9945F28CB10",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827668","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:59:20Z","15827730","50935281BEE2C55275096D7CE46DA21484335B06C72EC6CF34C629F5C6E5649B",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827678","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T08:00:38Z","15827743","9C1314BBD1CA12C18750D87A68098F2B6E86B8D8AAEFFF20946FCD970057F882",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827689","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T08:01:33Z","15827752","744924EA176AE8F769D46D036720175188CE769FA4DCBEBBBF815FD78E02C266",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827699","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T08:04:45Z","15827783","A8BA8BA520E485F64AA00116811EC463BEBAF5585C8664A694E87F01AF52AAE9",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827730","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"

@andy108369
Copy link
Contributor Author

andy108369 commented Apr 12, 2024

The provider 0.5.12 does not exhibit the issue of excessive resource reporting 🚀

Next steps:

  • update the charts
  • announce an update across all providers to this version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 repo/provider Akash provider-services repo issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants