Question about Loss is infinite or NaN #3

Lilzhuzixi · 2024-05-22T13:31:14Z

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply.
Traceback (most recent call last):
File "train.py", line 238, in
main(args)
File "train.py", line 165, in main
trainer.train()
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train
super().train(self.start_epoch, self.max_epoch)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train
self.run_epoch()
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch
loss_summary = self.forward_backward(batch)
File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward
self.model_backward_and_update(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update
self.model_backward(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward
self.detect_anomaly(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly
raise FloatingPointError("Loss is infinite or NaN!")
FloatingPointError: Loss is infinite or NaN!

This is base2new_train_flowers.sh.

#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done

The text was updated successfully, but these errors were encountered:

htyao89 · 2024-05-22T17:39:29Z

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!

This is base2new_train_flowers.sh.
#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done

Do you have to adjust the EPS in the optimizer?

Lilzhuzixi · 2024-05-23T01:35:56Z

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!
This is base2new_train_flowers.sh.
#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done
Do you have to adjust the EPS in the optimizer?

yes，I have adjusted eps to 1e-3.

htyao89 · 2024-05-27T02:53:34Z

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!
This is base2new_train_flowers.sh.
#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done
Do you have to adjust the EPS in the optimizer?
yes，I have adjusted eps to 1e-3.

During my experiments, the loss NaN is all caused by the Adam optimizer. Can you provide the log file?

jianai13579 · 2024-07-29T13:49:36Z

I got around to the same problem even though I modified the eps.The log file has not been generated.
Traceback (most recent call last):
File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\train.py", line 238, in
main(args)
File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\train.py", line 165, in main
trainer.train()
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 386, in train
super().train(self.start_epoch, self.max_epoch)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 250, in train
self.run_epoch()
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 597, in run_epoch
loss_summary = self.forward_backward(batch)
File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward
self.model_backward_and_update(loss)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 302, in model_backward_and_update
self.model_backward(loss)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 291, in model_backward
self.detect_anomaly(loss)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 223, in detect_anomaly
raise FloatingPointError("Loss is infinite or NaN!")
FloatingPointError: Loss is infinite or NaN!

jianai13579 · 2024-07-29T16:06:23Z

I found that line 81 plus this paragraph was not executed, I typed the sentence eps=1e-3 under line 88, now it can run through

Lilzhuzixi · 2024-08-02T03:43:27Z

That's great！Thank you very much. I will also try adding this code when I go back and see if it works. ? ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: "htyao89/Textual-based_Class-aware_prompt_tuning" ***@***.***>; 发送时间: 2024年7月30日(星期二) 凌晨0:06 ***@***.***>; ***@***.******@***.***>; 主题: Re: [htyao89/Textual-based_Class-aware_prompt_tuning] Question about Loss is infinite or NaN (Issue #3) I found that line 81 plus this paragraph was not executed, I typed the sentence eps=1e-3 under line 88, now it can run through — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Loss is infinite or NaN #3

Question about Loss is infinite or NaN #3

Lilzhuzixi commented May 22, 2024

htyao89 commented May 22, 2024

Lilzhuzixi commented May 23, 2024

htyao89 commented May 27, 2024

jianai13579 commented Jul 29, 2024

jianai13579 commented Jul 29, 2024

Lilzhuzixi commented Aug 2, 2024 via email

Question about Loss is infinite or NaN #3

Question about Loss is infinite or NaN #3

Comments

Lilzhuzixi commented May 22, 2024

htyao89 commented May 22, 2024

Lilzhuzixi commented May 23, 2024

htyao89 commented May 27, 2024

jianai13579 commented Jul 29, 2024

jianai13579 commented Jul 29, 2024

Lilzhuzixi commented Aug 2, 2024 via email