-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory #43
Comments
Does your V100 GPUs have 32G memory or 16G memory? The training part requires 32GB V100 GPUs. And have you made any changes to the code? |
Dear Professor,
Hello!
I am using a server with 4 V100 GPUs (each with 32GB of memory). When I run the SimKGC project without any modifications in this environment, I encounter the following error:
Traceback (most recent call last):
File "main.py", line 22, in <module>
main()
File "main.py", line 18, in main
trainer.train_loop()
File "/root/autodl-tmp/SimKGC-main/trainer.py", line 76, in train_loop
self.train_epoch(epoch)
File "/root/autodl-tmp/SimKGC-main/trainer.py", line 151, in train_epoch
outputs = self.model(**batch_dict)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/root/miniconda3/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/autodl-tmp/SimKGC-main/models.py", line 66, in forward
hr_vector = self._encode(self.hr_bert,
File "/root/autodl-tmp/SimKGC-main/models.py", line 47, in _encode
outputs = encoder(input_ids=token_ids,
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 996, in forward
encoder_outputs = self.encoder(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 585, in forward
layer_outputs = layer_module(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 472, in forward
self_attention_outputs = self.attention(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 402, in forward
self_outputs = self.self(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 340, in forward
context_layer = torch.matmul(attention_probs, value_layer)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `result`
Previously, when I was not running in the 4 V100 environment, I received a different error in the code, which I modified. However, after making the changes, when I moved to the 4 V100 environment, I started encountering the "CUDA out of memory" error.
Could you please provide some guidance on how to address this issue?
Best regards!
At 2024-06-03 18:31:59, "Liang Wang" ***@***.***> wrote:
Does your V100 GPUs have 32G memory or 16G memory? The training part requires 32GB V100 GPUs.
And have you made any changes to the code?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello, I noticed in your readme.md file that it states if you encounter a "CUDA out of memory" issue, it might be due to limited hardware resources. However, I am using a server with 4 V100 GPUs, so why am I still facing this problem? Moreover, reducing the batch size did not resolve the issue.
The text was updated successfully, but these errors were encountered: