Cuda out of memory when traning with fp16 #2065

vigneshmj1997 · 2021-05-21T11:12:07Z

data:
corpus_1:
path_src: /home/test/Train.en
path_tgt: /home/test/Train.de
valid:
path_src: /mnt/hdd/vignesh/final/training/newstest2013.en
path_tgt: /mnt/hdd/vignesh/final/training/newstest2013.de
save_model: /home/test/OpenNMT-py/fusedadam
share_vocab: True
valid_steps: 5000
save_checkpoint_steps: 10000
train_steps: 1300000
warmup_steps: 8000
report_every: 1000
seed : 2
src_vocab: /home/test/Opennmt_en_de.vocab.src

#model
#train_from : /home/test/OpenNMT-py/checkpoints_step_270000.pt
encoder_type: transformer
decoder_type: transformer
enc_layers: 12
dec_layers: 1
rnn_size: 512
word_vec_size: 512
share_decoder_embeddings : True
share_embeddings : True
heads: 8
transformer_ff: 2048
model_dtype: "fp16"

#loss function
accum_count: 8
optim: "fusedadam"
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0
batch_size: 2048
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

max_generator_batches: 2

param_init: 0
param_init_glorot: 'true'
position_encoding: 'true'

world_size: 4
gpu_ranks: [0 ,1, 2, 3]
#log_file: log/en_de.log
#exp: transformer_student_deep_shallow_en_de
#train_from checkpoints/combined_wikimatrix_zoho_student/_step_295000.pt

This is my config file
I am getting cuda out of memory on using model_type=fp16 but the code works fine when i am using model_type=fp32
i reduced batch size till 128 still i get same error
if it helps :
1)I am using 2080ti gpu
2) I installed apex from the source not from requirements.txt
3) I have cuda version 11.1
4) Since apex only gets installed in cuda 10.2 i commented the error message ( choice was provided by apex)

francoishernandez · 2021-05-25T12:49:55Z

Have you tried with the filtertoolong transform? #2040 (comment)
Though it's strange that it would fail in fp16 but work in fp32.
Maybe there is an issue with your apex setup. Did you add these flags when building?
--global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam"

vigneshmj1997 · 2021-05-28T07:28:40Z

yeah i did used same command while installing apex , i also downgraded my cuda version to 10.2 tired the whole setup i am still getting same cuda out of memory error

vince62s closed this as completed Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda out of memory when traning with fp16 #2065

Cuda out of memory when traning with fp16 #2065

vigneshmj1997 commented May 21, 2021

francoishernandez commented May 25, 2021

vigneshmj1997 commented May 28, 2021 •

edited

Loading

Cuda out of memory when traning with fp16 #2065

Cuda out of memory when traning with fp16 #2065

Comments

vigneshmj1997 commented May 21, 2021

francoishernandez commented May 25, 2021

vigneshmj1997 commented May 28, 2021 • edited Loading

vigneshmj1997 commented May 28, 2021 •

edited

Loading