Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FCP not performing #163

Open
ThomasBlock opened this issue Nov 11, 2024 · 4 comments
Open

FCP not performing #163

ThomasBlock opened this issue Nov 11, 2024 · 4 comments

Comments

@ThomasBlock
Copy link

Hey. My FCP provider seems a little bit odd. i updated the system and some components, swapped gpus.

I can accept jobs via lagrange and they depoly.. but sometimes are not accessible by client, error 503 especially when gpu involved.

i comapred with the page https://provider.swanchain.io/cp/0x316a2e62D5001eC3393fc83424EDF8CDb5de3e99 and have seen that the gpu info seems outdated. maybe because my "cpu_name": "" has no longer a value?

( my collateral is also to low, can you please exectue the request from google forms and discord )

{
  "node_id": "04d6e29dec1f0fe33ea61e76bbd31f0f06e59ceb8f2b30837bc379e812a08fad1eb3306f624ee4931c3d21c3f057693b5773699303c6e64c39d5d81a9ccd557704",
  "cpAccount_address": "0x316a2e62D5001eC3393fc83424EDF8CDb5de3e99",
  "region": "North Rhine-Westphalia-DE",
  "cluster_info": [
    {
      "machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
      "cpu_name": "",
      "cpu": {
        "total": "64",
        "used": "9",
        "free": "55"
      },
      "vcpu": {
        "total": "64",
        "used": "9",
        "free": "55"
      },
      "memory": {
        "total": "132.00 GiB",
        "used": "0.00 GiB",
        "free": "131.00 GiB"
      },
      "gpu": {
        "driver_version": "",
        "cuda_version": "",
        "attached_gpus": 0,
        "details": []
      },
      "storage": {
        "total": "176.00 GiB",
        "used": "0.00 GiB",
        "free": "176.00 GiB"
      }
    },
    {
      "machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
      "cpu_name": "",
      "cpu": {
        "total": "24",
        "used": "3",
        "free": "21"
      },
      "vcpu": {
        "total": "24",
        "used": "3",
        "free": "21"
      },
      "memory": {
        "total": "17.00 GiB",
        "used": "0.00 GiB",
        "free": "16.00 GiB"
      },
      "gpu": {
        "driver_version": "560.35.03",
        "cuda_version": "12.6",
        "attached_gpus": 1,
        "details": [
          {
            "product_name": "NVIDIA 4070 SUPER",
            "status": "available",
            "fb_memory_usage": {
              "total": "12282 MiB",
              "used": "2 MiB",
              "free": "11904 MiB"
            },
            "original_name": "NVIDIA GeForce RTX 4070 SUPER",
            "index": "0"
          }
        ]
      },
      "storage": {
        "total": "437.00 GiB",
        "used": "0.00 GiB",
        "free": "437.00 GiB"
      }
    },
    {
      "machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
      "cpu_name": "",
      "cpu": {
        "total": "48",
        "used": "4",
        "free": "44"
      },
      "vcpu": {
        "total": "48",
        "used": "4",
        "free": "44"
      },
      "memory": {
        "total": "189.00 GiB",
        "used": "0.00 GiB",
        "free": "189.00 GiB"
      },
      "gpu": {
        "driver_version": "560.35.03",
        "cuda_version": "12.6",
        "attached_gpus": 2,
        "details": [
          {
            "product_name": "NVIDIA A4000",
            "status": "available",
            "fb_memory_usage": {
              "total": "16376 MiB",
              "used": "2 MiB",
              "free": "16004 MiB"
            },
            "original_name": "NVIDIA RTX A4000",
            "index": "0"
          },
          {
            "product_name": "NVIDIA A2000",
            "status": "available",
            "fb_memory_usage": {
              "total": "6138 MiB",
              "used": "2 MiB",
              "free": "5828 MiB"
            },
            "original_name": "NVIDIA RTX A2000",
            "index": "1"
          }
        ]
      },
      "storage": {
        "total": "437.00 GiB",
        "used": "0.00 GiB",
        "free": "437.00 GiB"
      }
    }
  ],
  "node_name": "ThomasBlock.io",
  "runtime": "containerd://1.7.5"
}

The exporter is the newest version and reports cpu correctly ( AMD / Intel )

kubectl describe po -n kube-system resource-exporter-ds |grep "Image:"
    Image:          filswan/resource-exporter:v11.3.0
    Image:          filswan/resource-exporter:v11.3.0
    Image:          filswan/resource-exporter:v11.3.0

node1
The node collect gpu info failed, if the node does not have a GPU, this error can be ignored. failed execute nvidia-smi, error:exec: "nvidia-smi": executable file not found in $PATH
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"machine_id":"484ad4b0-258d-4ea7-90c3-822e7e094de7","cpu_name":"INTEL","cpu":{"total":"64","used":"26","free":"38"},"vcpu":{"total":"64","used":"26","free":"38"},"memory":{"total":"132.34 GiB","used":"12.78 GiB","free":"118.30 GiB"},"storage":{"total":"195.78 GiB","used":"103.47 GiB","free":"92.29 GiB"}}

node2
{"gpu":{"driver_version":"560.35.03","cuda_version":"12.6","attached_gpus":1,"details":[{"original_name":"NVIDIA GeForce RTX 4070 SUPER","product_name":"NVIDIA 4070 SUPER","fb_memory_usage":{"total":"12282 MiB","used":"2 MiB","free":"11904 MiB"},"status":"available","index":"0"}]},"machine_id":"66adc456-4b02-4256-9949-4c8183a5c62f","cpu_name":"AMD","cpu":{"total":"24","used":"0","free":"24"},"vcpu":{"total":"24","used":"0","free":"24"},"memory":{"total":"17.14 GiB","used":"0.78 GiB","free":"16.02 GiB"},"storage":{"total":"486.53 GiB","used":"42.44 GiB","free":"444.08 GiB"}}

node3
{"gpu":{"driver_version":"560.35.03","cuda_version":"12.6","attached_gpus":2,"details":[{"original_name":"NVIDIA RTX A4000","product_name":"NVIDIA A4000","fb_memory_usage":{"total":"16376 MiB","used":"2 MiB","free":"16004 MiB"},"status":"available","index":"0"},{"original_name":"NVIDIA RTX A2000","product_name":"NVIDIA A2000","fb_memory_usage":{"total":"6138 MiB","used":"2 MiB","free":"5828 MiB"},"status":"available","index":"1"}]},"machine_id":"e42155b8-80c1-4935-a786-36c4d67177b1","cpu_name":"INTEL","cpu":{"total":"48","used":"0","free":"48"},"vcpu":{"total":"48","used":"0","free":"48"},"memory":{"total":"190.03 GiB","used":"0.96 GiB","free":"187.57 GiB"},"storage":{"total":"486.53 GiB","used":"151.61 GiB","free":"334.91 GiB"}}
@Normalnoise
Copy link
Collaborator

why the cpu name is "" of your CP? can you run lscpu?

@ThomasBlock
Copy link
Author

why the cpu name is "" of your CP? can you run lscpu?

@Normalnoise that is exactly the question i asked you. in the lower part you see that each induvidual node-exporter reports the correct cpu:

node1
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz

node2
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 5900X 12-Core Processor

node3
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz

@ThomasBlock
Copy link
Author

here the results of further testing. most jobs will work now.

  • cpu jobs are okay ( but all land on node 1 )
  • gpu jobs for A2000 and A4000 are okay ( on node3 )
  • node2 does not get jobs and the RTX 4070 SUPER is not listed on lagrange ( but maybe because this is a more exotic type.. will probably migrate to another GPU )

the following 400 error is still to be seen for every job. can i ignore it?

time="2024-11-13 09:16:16.928" level=info msg="uploading file to bucket, objectName: mcs_cache/e534840b-b936-4bcb-8fa6-a4aae0a62a3f.json, filePath: /root/cp/mcs_cache/e534840b-b936-4bcb-8fa6-a4aae0a62a3f.json" func=UploadFileToBucket file="storage_service.go:50"
time="2024-11-13 09:16:17.244" level=info msg="job_uuid: 32af907d-a427-49e1-b9fa-092cc5789579, spaceName: tetris33, hardwareName: Nvidia A2000 · 4 vCPU · 8 GiB" func=DeploySpaceTask file="space_service.go:962"
time="2024-11-13 09:16:17.375" level=info msg="space service deployed, job_uuid: 32af907d-a427-49e1-b9fa-092cc5789579, spaceName: tetris33" func=watchContainerRunningTime file="deploy.go:707"
time="2024-11-13 09:16:20.330" level=error msg="http status: 400 Bad Request, code:400, url:https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=mcs_cache/e534840b-b936-4bcb-8fa6-a4aae0a62a3f.json" func=HttpRequest file="restful.go:127"
time="2024-11-13 09:16:20.330" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=mcs_cache/e534840b-b936-4bcb-8fa6-a4aae0a62a3f.json failed, status:error, message:invalid param value:record not found" func=HttpRequest file="restful.go:154"
time="2024-11-13 09:16:20.330" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=mcs_cache/e534840b-b936-4bcb-8fa6-a4aae0a62a3f.json failed, status:error, message:invalid param value:record not found" func=HttpGet file="restful.go:64"
time="2024-11-13 09:16:27.178" level=info msg="file name:1_e534840b-b936-4bcb-8fa6-a4aae0a62a3f.json, chunk size:752" func=func1 file="file.go:248"
time="2024-11-13 09:16:32.645" level=info msg="successfully uploaded to MCS, jobuuid: 32af907d-a427-49e1-b9fa-092cc5789579" func=1 file="space_service.go:278"

@Normalnoise
Copy link
Collaborator

yes, you can ignore it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants