Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

level=warning msg="ubi task id: 142182, type: GPU, not found a resources availabl #68

Open
luhong123 opened this issue May 11, 2024 · 10 comments
Assignees

Comments

@luhong123
Copy link

luhong123 commented May 11, 2024

Here is my description

[GIN] 2024/05/11 - 15:36:03 | 200 |   34.248274ms |   38.104.153.43 | GET      "/api/v1/computing/cp"
time="2024-05-11 15:36:03.554" level=info msg="receive ubi task received: {ID:142182 Name:1000-0-7-130773 Type:1 ZkType:fil-c2-512M InputParam:https://286cb2c989.acl.swanipfs.com/ipfs/QmZRrAkkNcMbV8vyiRz5RRyYdk3CiqJNne3GvuxfKZv7me Signature:0xaaa3eaa064922a2fc2e1aecd4181d50f158f8d11830cea0a69716c7391411dec43e352fba49647ae7c645d41286ce00f3f28468db293f9d5e8230d03eddb239e01 Resource:0xc001094100}" func=DoUbiTaskForK8s file="ubi.go:43"
time="2024-05-11 15:36:03.588" level=info msg="checkResourceAvailableForUbi: needCpu: 1, needMemory: 5.00, needStorage: 1.00" func=checkResourceAvailableForUbi file="cp_service.go:996"
time="2024-05-11 15:36:03.589" level=info msg="checkResourceAvailableForUbi: remainingCpu: 89, remainingMemory: 1007.00, remainingStorage: 131.00" func=checkResourceAvailableForUbi file="cp_service.go:997"
time="2024-05-11 15:36:03.589" level=info msg="gpuName: NVIDIA-2080, nodeGpu: map[:0 kubernetes.io/os:0], nodeGpuSummary: map[xx-xx:map[NVIDIA-2080-TI:1]]" func=checkResourceAvailableForUbi file="cp_service.go:1008"
time="2024-05-11 15:36:03.589" level=warning msg="ubi task id: 142182, type: GPU, not found a resources available" func=DoUbiTaskForK8s file="ubi.go:124"

GPU present

rxxx:~/k8s/fcp/go-computing-provider# nvidia-smi
Sat May 11 16:29:42 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   31C    P8    26W / 260W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

xxx:~/k8s/fcp/go-computing-provider# kubectl  logs -f   nvidia-device-plugin-daemonset-wwv6p    -n   kube-system
I0511 07:48:36.445070       1 main.go:178] Starting FS watcher.
I0511 07:48:36.445190       1 main.go:185] Starting OS watcher.
I0511 07:48:36.445570       1 main.go:200] Starting Plugins.
I0511 07:48:36.445631       1 main.go:257] Loading configuration.
I0511 07:48:36.446613       1 main.go:265] Updating config with default resource matching patterns.
I0511 07:48:36.447078       1 main.go:276]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0511 07:48:36.447097       1 main.go:279] Retrieving plugins.
I0511 07:48:36.447960       1 factory.go:104] Detected NVML platform: found NVML library
I0511 07:48:36.448050       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0511 07:48:38.061302       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0511 07:48:38.062559       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0511 07:48:38.069493       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

xxx:~/k8s/fcp/go-computing-provider# kubectl  logs -f    resource-exporter-ds-m4c7b     -n   kube-system
{"gpu":{"driver_version":"520.61.05","cuda_version":"11080","attached_gpus":1,"details":[{"product_name":"NVIDIA 2080 TI","fb_memory_usage":{"total":"11264 MiB","used":"244 MiB","free":"11019 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"2 MiB","free":"253 MiB"}}]},"cpu_name":"AMD"}
{"gpu":{"driver_version":"520.61.05","cuda_version":"11080","attached_gpus":1,"details":[{"product_name":"NVIDIA 2080 TI","fb_memory_usage":{"total":"11264 MiB","used":"244 MiB","free":"11019 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"2 MiB","free":"253 MiB"}}]},"cpu_name":"AMD"}
{"gpu":{"driver_version":"520.61.05","cuda_version":"11080","attached_gpus":1,"details":[{"product_name":"NVIDIA 2080 TI","fb_memory_usage":{"total":"11264 MiB","used":"244 MiB","free":"11019 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"2 MiB","free":"253 MiB"}}]},"cpu_name":"AMD"}
{"gpu":{"driver_version":"520.61.05","cuda_version":"11080","attached_gpus":1,"details":[{"product_name":"NVIDIA 2080 TI","fb_memory_usage":{"total":"11264 MiB","used":"244 MiB","free":"11019 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"2 MiB","free":"253 MiB"}}]},"cpu_name":"AMD"}

GPU everything is normal

@Normalnoise
Copy link
Collaborator

what is your GPU module? nvidia 2080 or nvidia 2080 Ti

@luhong123
Copy link
Author

what is your GPU module? nvidia 2080 or nvidia 2080 Ti

NVIDIA GeForce RTX 2080 Ti

@luhong123
Copy link
Author

RUST_GPU_TOOLS_CUSTOM_GPU="GeForce RTX GeForce RTX 2080:4352"

图片

@sonic-chain
Copy link

Your RUST_GPU_TOOLS_CUSTOM_GPU configuration is incorrect, it should be 2080TI, refer to https://github.com/filecoin-project/bellperson?tab=readme-ov-file#supported--tested-cards

@luhong123
Copy link
Author

luhong123 commented May 15, 2024

Your RUST_GPU_TOOLS_CUSTOM_GPU configuration is incorrect, it should be 2080TI, refer to https://github.com/filecoin-project/bellperson?tab=readme-ov-file#supported--tested-cards

I now have four nodes in the cluster. How to set the 3090 and 3080 video cards?

@luhong123
Copy link
Author

RUST_GPU_TOOLS_CUSTOM_GPU="GeForce RTX 2080 Ti:4352"
time="2024-05-15 11:05:25.208" level=warning msg="ubi task id: 156576, type: GPU, not found a resources available" func=DoUbiTaskForK8s file="ubi.go:124"

@sonic-chain
Copy link

Your RUST_GPU_TOOLS_CUSTOM_GPU configuration is incorrect, it should be 2080TI, refer to https://github.com/filecoin-project/bellperson?tab=readme-ov-file#supported--tested-cards

I now have four nodes in the cluster. How to set the 3090 and 3080 video cards?

Refer to https://github.com/filecoin-project/bellperson?tab=readme-ov-file#supported--tested-cards to set RUST_GPU_TOOLS_CUSTOM_GPU

@sonic-chain
Copy link

RUST_GPU_TOOLS_CUSTOM_GPU="GeForce RTX 2080 Ti:4352" time="2024-05-15 11:05:25.208" level=warning msg="ubi task id: 156576, type: GPU, not found a resources available" func=DoUbiTaskForK8s file="ubi.go:124"

Provide more logs after receiving the task.

@luhong123
Copy link
Author

RUST_GPU_TOOLS_CUSTOM_GPU="GeForce RTX 2080 Ti:4352" time="2024-05-15 11:05:25.208" level=warning msg="ubi task id: 156576, type: GPU, not found a resources available" func=DoUbiTaskForK8s file="ubi.go:124"

Provide more logs after receiving the task.

[GIN] 2024/05/15 - 13:35:02 | 200 |   34.013604ms |   38.104.153.43 | GET      "/api/v1/computing/cp"
time="2024-05-15 13:35:03.237" level=info msg="receive ubi task received: {ID:157165 Name:1000-0-7-145779 Type:1 ZkType:fil-c2-512M InputParam:https://286cb2c989.acl.swanipfs.com/ipfs/QmZ7aJtQw42KAXwz8EDEyMafSACzoCVCPbii6o3vhhKjqJ Signature:0x359da3e74c4e4dd1184c80402a531f822732d22ea5d706b4952c415e7fdc0af47e7ed154124d5d9c6931c08367fb4f4ba8a43e0a149b69a2ddc1f85263bda15201 Resource:0xc000358b80}" func=DoUbiTaskForK8s file="ubi.go:43"
time="2024-05-15 13:35:03.238" level=info msg="ubi task sign verifing, task_id: 157165, type: fil-c2-512M, verify: true" func=DoUbiTaskForK8s file="ubi.go:83"
time="2024-05-15 13:35:03.273" level=info msg="checkResourceAvailableForUbi: needCpu: 1, needMemory: 5.00, needStorage: 1.00" func=checkResourceAvailableForUbi file="cp_service.go:996"
time="2024-05-15 13:35:03.274" level=info msg="checkResourceAvailableForUbi: remainingCpu: 89, remainingMemory: 1007.00, remainingStorage: 131.00" func=checkResourceAvailableForUbi file="cp_service.go:997"
time="2024-05-15 13:35:03.274" level=info msg="gpuName: NVIDIA-2080-Ti, nodeGpu: map[:0 kubernetes.io/os:0], nodeGpuSummary: map[XXXXmap[NVIDIA-2080-TI:1]]" func=checkResourceAvailableForUbi file="cp_service.go:1008"
time="2024-05-15 13:35:03.274" level=warning msg="ubi task id: 157165, type: GPU, not found a resources available" func=DoUbiTaskForK8s file="ubi.go:124"
[GIN] 2024/05/15 - 13:35:03 | 500 |   37.388376ms |   38.104.153.43 | POST     "/api/v1/computing/cp/ubi"
2024/05/15 13:35:56 http: TLS handshake error from 38.104.153.43:41776: remote error: tls: bad certificate
[GIN] 2024/05/15 - 13:35:56 | 200 |   34.195284ms |   38.104.153.43 | GET      "/api/v1/computing/cp"
2024/05/15 13:36:10 http: TLS handshake error from 80.66.83.49:41830: tls: first record does not look like a TLS handshake
2024/05/15 13:36:11 http: TLS handshake error from 80.66.83.49:41516: tls: first record does not look like a TLS handshake

time="2024-05-15 13:35:03.274" level=warning msg="ubi task id: 157165, type: GPU, not found a resources available" func=DoUbiTaskForK8s file="ubi.go:124"

@sonic-chain
Copy link

sonic-chain commented May 15, 2024

Your RUST_GPU_TOOLS_CUSTOM_GPU configuration is incorrect, it should be 2080TI, refer to https://github.com/filecoin-project/bellperson?tab=readme-ov-file#supported--tested-cards

I now have four nodes in the cluster. How to set the 3090 and 3080 video cards?

It should be caused by the case mismatch of Ti. Your current cluster has 3080 or 3090. You can first configure one of these two to receive tasks. I will fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants