You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@kzhang28 As I mentioned above, for instance, you can check the job "jobid": "application_1506638472019_12703". It use 8 GPUs allocate on the machine "ip": "**m51**". However, m51 only equip with 2 GPUs.
@kzhang28 As I mentioned above, for instance, you can check the job "jobid": "application_1506638472019_12703". It use 8 GPUs allocate on the machine "ip": "**m51**". However, m51 only equip with 2 GPUs.
Sorry, I should have made my question more clear. I meant whether you know the reason for this mismatch.
Recently, I analyzed the trace data and find the “cluster_machine_list” does not match "cluster_job_log".
For instance, one job log shows below, which submit an 8-GPUs job to machine "m51". However, "m51" only has 2 GPUs inside the machine.
Furthermore, I analyzed "cluster_gpu_log" and found the GPU number is totally different from the “cluster_machine_list” :
However,
I am really confused about the trace, could you please give me an explanation of it?
The text was updated successfully, but these errors were encountered: