train_text_to_video_sft 中的设置 #94

lijain · 2024-11-25T13:40:20Z

你好，请问下# Single GPU 训练时
train_text_to_video_sft.sh
ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_1.yaml"
训练没有问题。我将uncompiled_1.yaml-->deepspeed.yaml换为deepspeed报错，

请问下使用deepspeed该怎么改

sayakpaul · 2024-11-26T07:01:09Z

Can we move to English?

lijain · 2024-11-26T10:22:14Z

Using the script train_text_to_video_sft.sh for a single card is fine. ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_1.yaml changes to deepspeed.yaml and returns error, as shown in the following figure

sayakpaul · 2024-11-26T11:11:40Z

We know deepseepd is supported. Without a full handle on what is your accelerate config and error trace we cannot do much.

#10 (comment)

lijain · 2024-11-26T11:40:47Z

Yeah, I thought it was weird, too. My Settings are as follows

sayakpaul · 2024-11-26T13:09:43Z

Can you paste these settings instead of screenshots?

lijain · 2024-11-27T01:35:32Z

We know deepseepd is supported. Without a full handle on what is your accelerate config and error trace we cannot do much.

#10 (comment)
I also read your Settings and mine are basically similar, I use A100 80g should not have a memory deficit

deepspeed.yaml:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

sh setting：

export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
export TORCHDYNAMO_VERBOSE=1
export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
GPU_IDS="0"
LEARNING_RATES="1e-4"
LR_SCHEDULES="cosine_with_restarts"
OPTIMIZERS="adamw"
MAX_TRAIN_STEPS="20000"

ACCELERATE_CONFIG_FILE="accelerate_configs/deepspeed.yaml"
DATA_ROOT="/dataset/gen_gif/Dance-VideoGeneration-Dataset"
CAPTION_COLUMN="captions.txt"
VIDEO_COLUMN="videos.txt"
output_dir="/nProject/cogvideox-factory/log/cogvideox_sft_optimizer_${optimizer}steps${steps}lr-schedule${lr_schedule}_lr${learning_rate}/"
accelerate launch --config_file $ACCELERATE_CONFIG_FILE --gpu_ids $GPU_IDS training/cogvideox_image_to_video_sft2.py
--pretrained_model_name_or_path /nProject/zpretrain_ckpt/CogVideoX1.5-5B-I2V
--data_root $DATA_ROOT
--caption_column $CAPTION_COLUMN
--video_column $VIDEO_COLUMN
--height_buckets 480
--width_buckets 720
--frame_buckets 53
--dataloader_num_workers 8
--pin_memory
--num_validation_videos 1
--validation_epochs 1
--seed 42
--mixed_precision bf16
--output_dir $output_dir
--max_num_frames 53
--train_batch_size 1
--max_train_steps $MAX_TRAIN_STEPS
--checkpointing_steps 2000
--gradient_accumulation_steps 1
--gradient_checkpointing
--learning_rate $LEARNING_RATES
--lr_scheduler $LR_SCHEDULES
--lr_warmup_steps 800
--lr_num_cycles 1
--enable_slicing
--enable_tiling
--optimizer $OPTIMIZERS
--beta1 0.9
--beta2 0.95
--weight_decay 0.001
--max_grad_norm 1.0
--allow_tf32
--nccl_timeout 1800

sayakpaul · 2024-11-27T01:38:42Z

Can you format the commands?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_text_to_video_sft 中的设置 #94

train_text_to_video_sft 中的设置 #94

lijain commented Nov 25, 2024 •

edited

Loading

sayakpaul commented Nov 26, 2024

lijain commented Nov 26, 2024

sayakpaul commented Nov 26, 2024

lijain commented Nov 26, 2024 •

edited

Loading

sayakpaul commented Nov 26, 2024

lijain commented Nov 27, 2024 •

edited

Loading

sayakpaul commented Nov 27, 2024

train_text_to_video_sft 中的设置 #94

train_text_to_video_sft 中的设置 #94

Comments

lijain commented Nov 25, 2024 • edited Loading

sayakpaul commented Nov 26, 2024

lijain commented Nov 26, 2024

sayakpaul commented Nov 26, 2024

lijain commented Nov 26, 2024 • edited Loading

sayakpaul commented Nov 26, 2024

lijain commented Nov 27, 2024 • edited Loading

deepspeed.yaml:

sh setting：

sayakpaul commented Nov 27, 2024

lijain commented Nov 25, 2024 •

edited

Loading

lijain commented Nov 26, 2024 •

edited

Loading

lijain commented Nov 27, 2024 •

edited

Loading