Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_text_to_video_sft 中的设置 #94

Open
lijain opened this issue Nov 25, 2024 · 7 comments
Open

train_text_to_video_sft 中的设置 #94

lijain opened this issue Nov 25, 2024 · 7 comments

Comments

@lijain
Copy link

lijain commented Nov 25, 2024

你好,请问下# Single GPU 训练时
train_text_to_video_sft.sh
ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_1.yaml"
训练没有问题。我将uncompiled_1.yaml-->deepspeed.yaml换为deepspeed报错,
企业微信截图_17325422283473
请问下使用deepspeed该怎么改

@sayakpaul
Copy link
Collaborator

Can we move to English?

@lijain
Copy link
Author

lijain commented Nov 26, 2024

Using the script train_text_to_video_sft.sh for a single card is fine. ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_1.yaml changes to deepspeed.yaml and returns error, as shown in the following figure

@sayakpaul
Copy link
Collaborator

We know deepseepd is supported. Without a full handle on what is your accelerate config and error trace we cannot do much.

#10 (comment)

@lijain
Copy link
Author

lijain commented Nov 26, 2024

Yeah, I thought it was weird, too. My Settings are as follows
企业微信截图_1732621221375
企业微信截图_17326213619589

@sayakpaul
Copy link
Collaborator

Can you paste these settings instead of screenshots?

@lijain
Copy link
Author

lijain commented Nov 27, 2024

We know deepseepd is supported. Without a full handle on what is your accelerate config and error trace we cannot do much.

#10 (comment)
I also read your Settings and mine are basically similar, I use A100 80g should not have a memory deficit

deepspeed.yaml:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

sh setting:

export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
export TORCHDYNAMO_VERBOSE=1
export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
GPU_IDS="0"
LEARNING_RATES="1e-4"
LR_SCHEDULES="cosine_with_restarts"
OPTIMIZERS="adamw"
MAX_TRAIN_STEPS="20000"

ACCELERATE_CONFIG_FILE="accelerate_configs/deepspeed.yaml"
DATA_ROOT="/dataset/gen_gif/Dance-VideoGeneration-Dataset"
CAPTION_COLUMN="captions.txt"
VIDEO_COLUMN="videos.txt"
output_dir="/nProject/cogvideox-factory/log/cogvideox_sft_optimizer_${optimizer}steps${steps}lr-schedule${lr_schedule}_lr${learning_rate}/"
accelerate launch --config_file $ACCELERATE_CONFIG_FILE --gpu_ids $GPU_IDS training/cogvideox_image_to_video_sft2.py
--pretrained_model_name_or_path /nProject/zpretrain_ckpt/CogVideoX1.5-5B-I2V
--data_root $DATA_ROOT
--caption_column $CAPTION_COLUMN
--video_column $VIDEO_COLUMN
--height_buckets 480
--width_buckets 720
--frame_buckets 53
--dataloader_num_workers 8
--pin_memory
--num_validation_videos 1
--validation_epochs 1
--seed 42
--mixed_precision bf16
--output_dir $output_dir
--max_num_frames 53
--train_batch_size 1
--max_train_steps $MAX_TRAIN_STEPS
--checkpointing_steps 2000
--gradient_accumulation_steps 1
--gradient_checkpointing
--learning_rate $LEARNING_RATES
--lr_scheduler $LR_SCHEDULES
--lr_warmup_steps 800
--lr_num_cycles 1
--enable_slicing
--enable_tiling
--optimizer $OPTIMIZERS
--beta1 0.9
--beta2 0.95
--weight_decay 0.001
--max_grad_norm 1.0
--allow_tf32
--nccl_timeout 1800

@sayakpaul
Copy link
Collaborator

Can you format the commands?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants