Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ubi-worker-gpu on intel not working? code 132 #129

Open
ThomasBlock opened this issue Aug 16, 2024 · 14 comments
Open

ubi-worker-gpu on intel not working? code 132 #129

ThomasBlock opened this issue Aug 16, 2024 · 14 comments

Comments

@ThomasBlock
Copy link

ThomasBlock commented Aug 16, 2024

i switched from amd to Intel cpu
now the ubi worker is no longer successful. ( tested witg gpu 512 and gpu 32G task )

do you have a hint for me? @Normalnoise
there is no output and it just ends with code 132

Exit code 132 indicates that the container was terminated by a SIGILL signal, which usually means that the container tried to execute an illegal instruction.

CPU is Intel Xeon CPU E5-2696 v2

{"status":"Pull complete","progressDetail":{},"id":"d22a4480910c"}
{"status":"Digest: sha256:5538129f3569c0c7c6f708c9d4160f4e25f0feb0b0cff1d09de526f11d2c19f7"}
{"status":"Status: Downloaded newer image for filswan/ubi-worker-gpu-intel:latest"}
time="2024-08-15 22:41:57.901" level=warning msg="task_id: 776267, starting container, container name: fil-c2-32g-776267m4wfk" func=func1 file="ubi.go:698"
time="2024-08-15 22:42:01.816" level=warning msg="task_id: 776267, started container, container name: fil-c2-32g-776267m4wfk" func=func1 file="ubi.go:711"
docker ps -a
CONTAINER ID   IMAGE                                 COMMAND                 CREATED             STATUS                         PORTS     NAMES
1b10a4168417   filswan/ubi-worker-gpu-intel:latest   "ubi-bench c2"          6 minutes ago       Exited (132) 6 minutes ago               fil-c2-32g-776267m4wfk
c9c9dd54a55a   filswan/resource-exporter:v11.2.8     "./resource-exporter"   10 minutes ago      Up 10 minutes                            resource-exporter
623aca62f1d9   filswan/ubi-worker-gpu-amd:latest     "ubi-bench c2"          36 minutes ago      Exited (0) 36 minutes ago                fil-c2-512m-824031mc6b7
40db0b4d6cea   filswan/ubi-worker-gpu-amd:latest     "ubi-bench c2"          About an hour ago   Exited (0) About an hour ago             fil-c2-512m-824003n4dc5
92fbd2e11bde   filswan/ubi-worker-gpu-amd:latest     "ubi-bench c2"          2 hours ago         Exited (0) 2 hours ago                   fil-c2-512m-823106ae0yp
@ThomasBlock ThomasBlock changed the title ubi-worker-gpu on intel not working? ubi-worker-gpu on intel not working? code 132 Aug 16, 2024
@Normalnoise
Copy link
Collaborator

can you provide the ubi-ecp.log under the CP_PATH?

@ThomasBlock
Copy link
Author

can you provide the ubi-ecp.log under the CP_PATH?

there is nothing logged in ubi-ecp.log and docker logs 1b10a4168417 is also empty

when you google docker error 132, people are discussing about missing cpu extensions like avx2 .. but this processor has it..

@Normalnoise
Copy link
Collaborator

I have not met this issue, can you provide more information, like lscpu, system version and more system info

@ThomasBlock
Copy link
Author

I have not met this issue, can you provide more information, like lscpu, system version and more system info

sure.. can you tell me how to start filswan/ubi-worker-gpu-intel:latest standalone, maybe there are some more outputs?

i run prxomox hosts with ubuntu22 virtual machines, processor type is host.
( this works for all other AMD ECP and FCP Servers. )

The Intel node is a Dell R720 Server with Two Physical Processors and 384 GB RAM

lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   48
  On-line CPU(s) list:    0-47
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel
  Model name:             Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz
    BIOS Model name:            Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz  CPU @ 2.5GHz
    BIOS CPU family:      179
    CPU family:           6
    Model:                62
    Thread(s) per core:   2
    Core(s) per socket:   12
    Socket(s):            2
    Stepping:             4
    CPU(s) scaling MHz:   98%
    CPU max MHz:          3500.0000
    CPU min MHz:          1200.0000
    BogoMIPS:             4999.86
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx
                           fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts 
                          rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
                           smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer ae
                          s xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexprio
                          rity ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts vnmi flush_l1d
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    768 KiB (24 instances)
  L1i:                    768 KiB (24 instances)
  L2:                     6 MiB (24 instances)
  L3:                     60 MiB (2 instances)
NUMA:                     
  NUMA node(s):           2
  NUMA node0 CPU(s):      0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
  NUMA node1 CPU(s):      1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Unknown: No mitigations
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIB
                          RS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

@ThomasBlock
Copy link
Author

@Normalnoise still relevant, can you check?

@Normalnoise
Copy link
Collaborator

Can you provide your docker logs when you receive the ubi task? maybe run docker logs -f <container id>

@ThomasBlock
Copy link
Author

Can you provide your docker logs when you receive the ubi task? maybe run docker logs -f <container id>

No. there are zero logs. it just says error code code 132 / 137. even if i start it manually

docker run -d \
  --name fil-worker \
  --gpus all \
  -v /path/on/host:/path/in/container \
  filswan/ubi-worker-gpu-intel:latest
bceca15b5f843b6972654c17e720d9193cb988e805b3c6dd5afc174eb7ab84b7

user@swanZK:~$ docker ps -a
CONTAINER ID   IMAGE                                 COMMAND                  CREATED          STATUS                       PORTS     NAMES
bceca15b5f84   filswan/ubi-worker-gpu-intel:latest   "/bin/bash -c 'sleep…"   16 minutes ago   Exited (137) 12 minutes ago             fil-worker
fb4b2b4f8da6   filswan/ubi-worker-gpu-intel:latest   "ubi-bench c2"           6 minutes ago    Exited (132) 6 minutes ago             fil-c2-512m-1253856wd3ah

docker logs bceca15b5f84

docker logs fb4b2b4f8da6

-> both empty results

@Normalnoise
Copy link
Collaborator

no, you can not run it standalone. I think I need to test it in VM next week

@ThomasBlock
Copy link
Author

no, you can not run it standalone. I think I need to test it in VM next week

Thank you. i also chanegd the cpu type of the vm, for example to
kvm64
qemu64

but swan-provider still wants to load filswan/ubi-worker-gpu-intel:latest ( altough we dont have intel now ) and fails with error 132

lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          40 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   48
  On-line CPU(s) list:    0-47
Vendor ID:                GenuineIntel
  Model name:             QEMU Virtual CPU version 2.5+
    CPU family:           15
    Model:                107
    Thread(s) per core:   1
    Core(s) per socket:   48
    Socket(s):            1
    Stepping:             1
    BogoMIPS:             4999.99
    Flags:                fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid ts
                          c_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti

@ThomasBlock
Copy link
Author

@Normalnoise any updates on this? would love to tun 32G tasks..

@sonic-chain
Copy link

sonic-chain commented Nov 19, 2024

Can you provide your docker logs when you receive the ubi task? maybe run docker logs -f <container id>

No. there are zero logs. it just says error code code 132 / 137. even if i start it manually

docker run -d \
  --name fil-worker \
  --gpus all \
  -v /path/on/host:/path/in/container \
  filswan/ubi-worker-gpu-intel:latest
bceca15b5f843b6972654c17e720d9193cb988e805b3c6dd5afc174eb7ab84b7

user@swanZK:~$ docker ps -a
CONTAINER ID   IMAGE                                 COMMAND                  CREATED          STATUS                       PORTS     NAMES
bceca15b5f84   filswan/ubi-worker-gpu-intel:latest   "/bin/bash -c 'sleep…"   16 minutes ago   Exited (137) 12 minutes ago             fil-worker
fb4b2b4f8da6   filswan/ubi-worker-gpu-intel:latest   "ubi-bench c2"           6 minutes ago    Exited (132) 6 minutes ago             fil-c2-512m-1253856wd3ah

docker logs bceca15b5f84

docker logs fb4b2b4f8da6

-> both empty results

You can run the above command directly without the -d parameter. Provide the log output of the screen. @ThomasBlock

@ThomasBlock
Copy link
Author

docker run -d
--name fil-worker
--gpus all
-v /path/on/host:/path/in/container
filswan/ubi-worker-gpu-intel:latest

Okay @sonic-chain . its still a blank line, no logs are emitted ( because its only bash sleep, we need some kind of input data i guess? )

docker run --name fil-worker --gpus all filswan/ubi-worker-gpu-intel:latest

CONTAINER ID   IMAGE                                 COMMAND                  CREATED         STATUS                      PORTS     NAMES
660b895a1d9a   filswan/ubi-worker-gpu-intel:latest   "/bin/bash -c 'sleep…"   2 minutes ago   Up 2 minutes                          fil-worker

@sonic-chain
Copy link

docker run -it --gpus all --memory=3g --env RUST_LOG=Debug --env PARAM_URL="https://286cb2c989.acl.swanipfs.com/ipfs/QmTgoX6LkzZTsTjSjXvujzgJEHBLTEg3KMUadQGnyTrNFG" --env RUST_GPU_TOOLS_CUSTOM_GPU="NVIDIA GeForce RTX 3080:8704" -v /var/tmp/filecoin-proof-parameters:/var/tmp/filecoin-proof-parameters filswan/ubi-worker-gpu-intel:latest ubi-bench c2

You need to modify RUST_GPU_TOOLS_CUSTOM_GPU and -v mount parameter path. @ThomasBlock

@ThomasBlock
Copy link
Author

docker run -it --gpus all --memory=3g --env RUST_LOG=Debug --env PARAM_URL="https://286cb2c989.acl.swanipfs.com/ipfs/QmTgoX6LkzZTsTjSjXvujzgJEHBLTEg3KMUadQGnyTrNFG" --env RUST_GPU_TOOLS_CUSTOM_GPU="NVIDIA GeForce RTX 3080:8704" -v /var/tmp/filecoin-proof-parameters:/var/tmp/filecoin-proof-parameters filswan/ubi-worker-gpu-intel:latest ubi-bench c2

You need to modify RUST_GPU_TOOLS_CUSTOM_GPU and -v mount parameter path. @ThomasBlock

Thank you. @sonic-chain . but we can not learn anything new from this. the error is 132, the output is zero.
we need to fix error 132

root@node3:~# docker run -it --gpus all --memory=3g --env RUST_LOG=Debug --env PARAM_URL="https://286cb2c989.acl.swanipfs.com/ipfs/QmTgoX6LkzZTsTjSjXvujzgJEHBLTEg3KMUadQGnyTrNFG" --env RUST_GPU_TOOLS_CUSTOM_GPU="NVIDIA RTX A4000:6144" -v /var/tmp/filecoin-proof-parameters:/var/tmp/filecoin-proof-parameters filswan/ubi-worker-gpu-intel:latest ubi-bench c2
root@node3:~# 
docker ps -a
CONTAINER ID   IMAGE                                 COMMAND                  CREATED         STATUS                       PORTS     NAMES
d76f7e654275   filswan/ubi-worker-gpu-intel:latest   "ubi-bench c2"           8 seconds ago   Exited (132) 6 seconds ago             cranky_chaplygin
docker inspect d76f7e654275
[
    {
        "Id": "d76f7e65427587757507d8d3ec1963d6e0bb2e8f5e4c780b8f4ce40fc8b84b98",
        "Created": "2024-11-22T16:12:50.858636132Z",
        "Path": "ubi-bench",
        "Args": [
            "c2"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 132,
            "Error": "",
            "StartedAt": "2024-11-22T16:12:51.025675822Z",
            "FinishedAt": "2024-11-22T16:12:51.824079204Z"
        },
        "Image": "sha256:7164d871ee9f6085ce9e1508242849e70c4f1142c7a0036174ba6fa1efde08d4",
        "ResolvConfPath": "/var/lib/docker/containers/d76f7e65427587757507d8d3ec1963d6e0bb2e8f5e4c780b8f4ce40fc8b84b98/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/d76f7e65427587757507d8d3ec1963d6e0bb2e8f5e4c780b8f4ce40fc8b84b98/hostname",
        "HostsPath": "/var/lib/docker/containers/d76f7e65427587757507d8d3ec1963d6e0bb2e8f5e4c780b8f4ce40fc8b84b98/hosts",
        "LogPath": "/var/lib/docker/containers/d76f7e65427587757507d8d3ec1963d6e0bb2e8f5e4c780b8f4ce40fc8b84b98/d76f7e65427587757507d8d3ec1963d6e0bb2e8f5e4c780b8f4ce40fc8b84b98-json.log",
        "Name": "/cranky_chaplygin",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "docker-default",
        "ExecIDs": null,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants