Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda out of memory when traning with fp16 #2065

Closed
vigneshmj1997 opened this issue May 21, 2021 · 2 comments
Closed

Cuda out of memory when traning with fp16 #2065

vigneshmj1997 opened this issue May 21, 2021 · 2 comments

Comments

@vigneshmj1997
Copy link

data:
corpus_1:
path_src: /home/test/Train.en
path_tgt: /home/test/Train.de
valid:
path_src: /mnt/hdd/vignesh/final/training/newstest2013.en
path_tgt: /mnt/hdd/vignesh/final/training/newstest2013.de
save_model: /home/test/OpenNMT-py/fusedadam
share_vocab: True
valid_steps: 5000
save_checkpoint_steps: 10000
train_steps: 1300000
warmup_steps: 8000
report_every: 1000
seed : 2
src_vocab: /home/test/Opennmt_en_de.vocab.src

#model
#train_from : /home/test/OpenNMT-py/checkpoints_step_270000.pt
encoder_type: transformer
decoder_type: transformer
enc_layers: 12
dec_layers: 1
rnn_size: 512
word_vec_size: 512
share_decoder_embeddings : True
share_embeddings : True
heads: 8
transformer_ff: 2048
model_dtype: "fp16"

#loss function
accum_count: 8
optim: "fusedadam"
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0
batch_size: 2048
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

max_generator_batches: 2

param_init: 0
param_init_glorot: 'true'
position_encoding: 'true'

world_size: 4
gpu_ranks: [0 ,1, 2, 3]
#log_file: log/en_de.log
#exp: transformer_student_deep_shallow_en_de
#train_from checkpoints/combined_wikimatrix_zoho_student/_step_295000.pt

This is my config file
I am getting cuda out of memory on using model_type=fp16 but the code works fine when i am using model_type=fp32
i reduced batch size till 128 still i get same error
if it helps :
1)I am using 2080ti gpu
2) I installed apex from the source not from requirements.txt
3) I have cuda version 11.1
4) Since apex only gets installed in cuda 10.2 i commented the error message ( choice was provided by apex)

@francoishernandez
Copy link
Member

Have you tried with the filtertoolong transform? #2040 (comment)
Though it's strange that it would fail in fp16 but work in fp32.
Maybe there is an issue with your apex setup. Did you add these flags when building?
--global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam"

@vigneshmj1997
Copy link
Author

vigneshmj1997 commented May 28, 2021

yeah i did used same command while installing apex , i also downgraded my cuda version to 10.2 tired the whole setup i am still getting same cuda out of memory error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants