Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support accelerator optimized VMs in Google Cloud #5372

Closed
siddharthab opened this issue Oct 5, 2024 · 6 comments · Fixed by #5406
Closed

Support accelerator optimized VMs in Google Cloud #5372

siddharthab opened this issue Oct 5, 2024 · 6 comments · Fixed by #5406

Comments

@siddharthab
Copy link
Contributor

New feature

Usage scenario

Google Cloud Batch has accelerator optimized VMs, that need to be configured a little differently. They don't need to specify the accelerator type and count, but would still need the installGpuDrivers set to true. See Google documentation.

Suggest implementation

For this line, we also need to check if the task's machine type is g2-*, a2-* or a3-* even if accelerator type and count were not set. I think this should be enough. Happy to send a PR.

@bentsherman
Copy link
Member

PRs are welcome, even if you can only get part of the way

siddharthab pushed a commit to siddharthab/nextflow that referenced this issue Oct 17, 2024
Also clean up some old logic for container options when using GPUs.
These are now automatically handled by Google Cloud.

Fixes nextflow-io#5372.

Signed-off-by: Siddhartha Bagaria <[email protected]>
@zihhuafang
Copy link

zihhuafang commented Dec 11, 2024

Hi,
I am wondering if Nextflow supports custom vCPUs and memory for g2-* machine types or the machine types for GPUs.
I attempted to set up a custom g2 machine with the following configuration:

    accelerator 4, type: 'nvidia-l4'
    cpus 48
    memory '216 GB'
    machineType 'g2-*'
    disk 1500.GB, type: 'local-ssd'

Got the following error from Nextflow:

Caused by:
  INVALID_ARGUMENT: Accelerator field is invalid. Machine type g2-standard-96 does not support accelerator with type nvidia-l4 and GPU count 4. Please make sure that the configuration meets this requirement: https://cloud.google.com/compute/docs/gpus#l4-gpus.

I am able to manually create an instance with the custom settings, as shown in the screenshot below:
Screenshot 2024-12-11 at 13 02 54

I also tried to leave out machineType and got the following error:

Caused by:
  INVALID_ARGUMENT: Accelerator field is invalid. Accelerator with type nvidia-l4 should use g2 machine types. Please make sure that the configuration meets this requirement: https://cloud.google.com/compute/docs/gpus#l4-gpus.

Currently, Nextflow doesn't seem to support custom GCP VM?
I would like to increase memory for the same number of GPUs used.

@siddharthab
Copy link
Contributor Author

Hi @zihhuafang, what you are trying to do can be achieved by setting the machineType to be g2-custom-48-221184. See Google documentation. You will need this PR to ask Google Batch to install the GPU drivers on your machine though.

@zihhuafang
Copy link

Hi @siddharthab,
Thanks for the pointer!
I thought the request for custom machine was made through specifying cpus and memory.
Will wait for the PR then!

@zihhuafang
Copy link

Hi @siddharthab,
Sorry, one more question. Does this PR also work for custom N1 machine with T4 GPU?
I want to set up a custom N1 VM with extend memory, so the machine type is
custom-32-225280-ext (4 NVIDIA T4). In this case, I couldn't use accelerator 4, type: 'nvidia-l4' as the machine type did not start with N1-* and I got the following error:

INVALID_ARGUMENT: Accelerator field is invalid. Accelerator with machine type custom-32-225280-ext (4 NVIDIA T4) is not supported. Please make sure that the configuration meets this requirement: https://cloud.google.com/compute/docs/gpus.

So. it would also need the installGpuDrivers set to true in this case since I call up custom-32-225280-ext (4 NVIDIA T4) without specifying accelerator.

@siddharthab
Copy link
Contributor Author

@zihhuafang the machine type can not have GPU specs like how you have specified them. You can first try to create the machine using the gcloud command line (in the web console, you can click on Equivalent Code to get this command line):

gcloud compute instances create custom-machine-with-gpu \
    --machine-type=custom-4-43008-ext \
    --accelerator=count=1,type=nvidia-tesla-t4 \
    # Other options

So you have to specify the accelerator field separately. But I think this is not supported by Google Batch. You can try creating a job with these specs directly using the gcloud command like this:

gcloud batch jobs submit job-custom-machine-with-gpu --location us-central1 --config - <<EOD
{"taskGroups":[{"taskCount":"1","parallelism":"1","taskSpec":{"computeResource":{"cpuMilli":"1000","memoryMib":"512"},"runnables":[{"script":{"text":"echo \"Hello\""}}],"volumes":[]}}],"allocationPolicy":{"instances":[{"policy":{,"accelerators":{"count":1,"type":"nvidia-tesla-t4"},"machineType":"custom-4-43008-ext"}}]}}
EOD

And this will give the error:

ERROR: (gcloud.batch.jobs.submit) INVALID_ARGUMENT: Accelerator field is invalid. Accelerator with machine type custom-4-43008-ext is not supported. Please make sure that the configuration meets this requirement: https://cloud.google.com/compute/docs/gpus.

Your best hope is to reach out to Google Cloud reps if you want this feature supported in Batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants