Execution keeps breaking even when running on A100 GPU #25

hannahhb · 2024-09-15T18:19:18Z

Using the starter kit, the LLAMA baseline isn't working for me still and keeps breaking due to OOM on the evaluation stage despite using a A100 GPU. Is there anything that can be done to stop this from happening?

hannahhb · 2024-09-15T18:24:00Z

I've tried os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:64' and 128 and 'expandable_segments:True'

hannahhb · 2024-09-16T08:32:04Z

update: I tried running the script on A6000 rented on cudo compute but it still breaks :(( any suggestions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution keeps breaking even when running on A100 GPU #25

Execution keeps breaking even when running on A100 GPU #25

hannahhb commented Sep 15, 2024

hannahhb commented Sep 15, 2024

hannahhb commented Sep 16, 2024

Execution keeps breaking even when running on A100 GPU #25

Execution keeps breaking even when running on A100 GPU #25

Comments

hannahhb commented Sep 15, 2024

hannahhb commented Sep 15, 2024

hannahhb commented Sep 16, 2024