Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“cluster_machine_list” critical issue -- Machine GPU numbers NOT match to job log #8

Open
Qinghao-Hu opened this issue Nov 7, 2020 · 3 comments

Comments

@Qinghao-Hu
Copy link

Recently, I analyzed the trace data and find the “cluster_machine_list” does not match "cluster_job_log".

For instance, one job log shows below, which submit an 8-GPUs job to machine "m51". However, "m51" only has 2 GPUs inside the machine.

m51,2, 12GB

{
    "status": "Pass",
    "vc": "2869ce",
    "jobid": "application_1506638472019_12703",
    "attempts": [
        {
            "start_time": "2017-10-06 14:40:02",
            "end_time": "2017-10-09 05:19:16",
            "detail": [
                {
                    "ip": "**m51**",
                    "gpus": [
                        "gpu0",
                        "gpu1",
                        "gpu2",
                        "gpu3",
                        "gpu4",
                        "gpu5",
                        "gpu6",
                        "gpu7
                    ]
                }
            ]
        }
    ]

Furthermore, I analyzed "cluster_gpu_log" and found the GPU number is totally different from the “cluster_machine_list” :

Machine details from “cluster_machine_list”

Total Machine Numbers 2 GPU Machine(12GB) Numbers 8 GPU Machine(24GB) Numbers
552 321 231

However,

Machine details analyze from “cluster_gpu_log”

Total Machine Numbers 8 GPU Machine Numbers 4 GPU Machine Numbers 0 GPU Machine Numbers others(3 or 2 GPUs
552 264 271 13 4

I am really confused about the trace, could you please give me an explanation of it?

@kzhang28
Copy link

kzhang28 commented Sep 3, 2021

@Tonyhao96 any clues about the issue you mentioned here?

@Qinghao-Hu
Copy link
Author

@kzhang28 As I mentioned above, for instance, you can check the job "jobid": "application_1506638472019_12703". It use 8 GPUs allocate on the machine "ip": "**m51**". However, m51 only equip with 2 GPUs.

@kzhang28
Copy link

kzhang28 commented Sep 4, 2021

@kzhang28 As I mentioned above, for instance, you can check the job "jobid": "application_1506638472019_12703". It use 8 GPUs allocate on the machine "ip": "**m51**". However, m51 only equip with 2 GPUs.
Sorry, I should have made my question more clear. I meant whether you know the reason for this mismatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants