Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Loss is infinite or NaN #3

Open
Lilzhuzixi opened this issue May 22, 2024 · 6 comments
Open

Question about Loss is infinite or NaN #3

Lilzhuzixi opened this issue May 22, 2024 · 6 comments

Comments

@Lilzhuzixi
Copy link

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply.
Traceback (most recent call last):
File "train.py", line 238, in
main(args)
File "train.py", line 165, in main
trainer.train()
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train
super().train(self.start_epoch, self.max_epoch)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train
self.run_epoch()
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch
loss_summary = self.forward_backward(batch)
File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward
self.model_backward_and_update(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update
self.model_backward(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward
self.detect_anomaly(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly
raise FloatingPointError("Loss is infinite or NaN!")
FloatingPointError: Loss is infinite or NaN!

This is base2new_train_flowers.sh.

#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done
@htyao89
Copy link
Owner

htyao89 commented May 22, 2024

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!

This is base2new_train_flowers.sh.

#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done

Do you have to adjust the EPS in the optimizer?

image

@Lilzhuzixi
Copy link
Author

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!
This is base2new_train_flowers.sh.

#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done

Do you have to adjust the EPS in the optimizer?

image

yes,I have adjusted eps to 1e-3.

@htyao89
Copy link
Owner

htyao89 commented May 27, 2024

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!
This is base2new_train_flowers.sh.

#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done

Do you have to adjust the EPS in the optimizer?
image

yes,I have adjusted eps to 1e-3.

During my experiments, the loss NaN is all caused by the Adam optimizer. Can you provide the log file?

@jianai13579
Copy link

I got around to the same problem even though I modified the eps.The log file has not been generated.
Traceback (most recent call last):
File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\train.py", line 238, in
main(args)
File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\train.py", line 165, in main
trainer.train()
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 386, in train
super().train(self.start_epoch, self.max_epoch)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 250, in train
self.run_epoch()
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 597, in run_epoch
loss_summary = self.forward_backward(batch)
File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward
self.model_backward_and_update(loss)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 302, in model_backward_and_update
self.model_backward(loss)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 291, in model_backward
self.detect_anomaly(loss)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 223, in detect_anomaly
raise FloatingPointError("Loss is infinite or NaN!")
FloatingPointError: Loss is infinite or NaN!

@jianai13579
Copy link

I found that line 81 plus this paragraph was not executed, I typed the sentence eps=1e-3 under line 88, now it can run through

@Lilzhuzixi
Copy link
Author

Lilzhuzixi commented Aug 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants