-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provider reports excessively high amount of Allocatable cpu & ram when inventory operator hits an ERROR #192
Comments
sg.lnlm - provider after 16 hours of uptime
Logs
recovered after operator-inventory restart
recovered:
|
Im also having some weird issues. When This happens I cannot bid for gpus on a different node. Fixing it requires bouncing of the operator-inventory
|
Narrowing the issue down, based on the providers uptime (~5 days) - it appears that only providers that have or had |
Couple of additional observations:
|
Now the Hurricane provider keeps always reporting |
New issue after restarting a worker node node5 which had akash-provider-0 and Operator-Inventory running on. I have 11 gpus total, says 7 active but inventory says all 11gpus are used. 0 gpus are pending. Fix by bouncing akash-provider-0 and operator-inventory. Then the inventory started to show correctly again.
Inventory completely stopped working.
After starting both services for "akash-provider-0, Operatory-inventory"
|
Fixed the Hurricane reporting. Possibly it was caused by some of the deployments in
|
it appears to be still an issue with provider-services v0.5.9 & v0.5.11:
|
(pdx.nb.akash.pub 4090s provider) The node1,node2 had wrong nvidia.ko driver version installed 550 instead of 535. This caused some pods stuck in
Which in turn triggered this bug (see GPU count for node1,node2):
I've deleted those that stuck in
|
pdx.nb provider - issue happened in under 55 mins after operator-inventory restartI think the pdx.nb provider is a good candidate to start monitoring the issue, since it appears this issue started to occur often after node1.pdx.nb.akash.pub node has been replaced yesterday (the mainboard, GPU's & ceph disk), except for the main OS disks (rootfs).
this might be also the trigger:
|
sg.lnlm.akash.pub - issue got triggeredLooks like this triggered the "excessively large stats" issue on the sg.lnlm.akash.pub:
complete logs with the timestamps: Additionallythere haven't been
however, there have been some
|
The provider Next steps:
|
akash network 0.30.0
provider 0.5.4
Observation
nvdp/nvidia-device-plugin
helm-chart by mistake and then removed after short time:I reinstalled the
operator-inventory
, it helped at the first look.However, after some time I've noticed the issue appeared again:
Additionally, I've noticed this error in the
operator-inventory
, but soon figured that it doesn't seem to be the cause comparing to the other providers which seen the same error in their inventory operator:Provider logs
sg.lnlm.provider.log
Detailed info (8443/status)
sg.lnlm.provider-info-detailed.log
Additional observations
operator-inventory
run for over 16 minutes, the issue doesn't appear yet. I'll keep monitoring it.The text was updated successfully, but these errors were encountered: