chatglm3-6b 用DeepSpeed微调报错exits with return code = -7,但用单卡微调没问题。参考了issue#1683的方法也没用 #2061
1 task done
This problem has been already solved
训练命令如下,其中参考了issue#1683的方法加上export NCCL_P2P_LEVEL=NVL也没用
module load anaconda cudnn/ compilers/cuda/11.8 compilers/gcc/11.3.0
source activate myLLM
deepspeed --num_gpus 4 --master_port=9901 src/
--deepspeed ds_config.json
--stage sft
--model_name_or_path ChatGLM3-6B
--dataset allSemiDataNewMerge
--template chatglm3
--finetuning_type lora
--lora_rank 32
--lora_target all
--cutoff_len 1024
--output_dir outputs_ChatGLM3_lora_nopt_sft
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 50
--save_total_limit 3
--learning_rate 5e-5
--num_train_epochs 1
--seed 42
--ddp_timeout 30000
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True \
ds_config.json 文件如下
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"overlap_comm": false,
"contiguous_gradients": true
Running tokenizer on dataset: 100%|██████████| 68658/68658 [04:18<00:00, 203.98 examples/s]
Running tokenizer on dataset: 100%|██████████| 68658/68658 [04:19<00:00, 201.93 examples/s]
Running tokenizer on dataset: 100%|██████████| 68658/68658 [04:19<00:00, 264.99 examples/s]
/home/bingxing2/home/scx9042/.conda/envs/myLLM/lib/python3.10/site-packages/transformers/ FutureWarning:
is deprecated and will be removed in version 5 of 🤗 Transformers. Use--hub_token
Running tokenizer on dataset: 100%|██████████| 68658/68658 [04:19<00:00, 264.85 examples/s]
/home/bingxing2/home/scx9042/.conda/envs/myLLM/lib/python3.10/site-packages/transformers/ FutureWarning:
is deprecated and will be removed in version 5 of 🤗 Transformers. Use--hub_token
[WARNING|] 2024-01-03 14:05:27,621 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore
in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method_set_gradient_checkpointing
in your model.[WARNING|] 2024-01-03 14:05:27,870 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore
in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method_set_gradient_checkpointing
in your model.[2024-01-03 14:05:33,233] [INFO] [] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-01-03 14:05:33,239] [INFO] [] [Rank 0] Using client Optimizer as basic optimizer
[2024-01-03 14:05:33,239] [INFO] [] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-01-03 14:05:33,263] [INFO] [] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-01-03 14:05:33,263] [INFO] [] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-01-03 14:05:33,264] [INFO] [] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-01-03 14:05:33,264] [INFO] [] Reduce bucket size 500000000
[2024-01-03 14:05:33,264] [INFO] [] Allgather bucket size 500000000
[2024-01-03 14:05:33,264] [INFO] [] CPU Offload: False
[2024-01-03 14:05:33,264] [INFO] [] Round robin gradient partitioning: False
[2024-01-03 14:05:35,314] [INFO] [] Before initializing optimizer states
[2024-01-03 14:05:35,315] [INFO] [] MA 11.82 GB Max_MA 11.85 GB CA 11.87 GB Max_CA 12 GB
[2024-01-03 14:05:35,315] [INFO] [] CPU Virtual Memory: used = 76.4 GB, percent = 30.1%
[2024-01-03 14:05:35,519] [INFO] [] After initializing optimizer states
[2024-01-03 14:05:35,520] [INFO] [] MA 11.93 GB Max_MA 12.1 GB CA 12.15 GB Max_CA 12 GB
[2024-01-03 14:05:35,521] [INFO] [] CPU Virtual Memory: used = 76.52 GB, percent = 30.1%
[2024-01-03 14:05:35,521] [INFO] [] optimizer state initialized
[2024-01-03 14:05:35,706] [INFO] [] After initializing ZeRO optimizer
[2024-01-03 14:05:35,707] [INFO] [] MA 11.93 GB Max_MA 11.93 GB CA 12.15 GB Max_CA 12 GB
[2024-01-03 14:05:35,707] [INFO] [] CPU Virtual Memory: used = 76.51 GB, percent = 30.1%
[2024-01-03 14:05:35,710] [INFO] [] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2024-01-03 14:05:35,711] [INFO] [] [Rank 0] DeepSpeed using client LR scheduler
[2024-01-03 14:05:35,711] [INFO] [] [Rank 0] DeepSpeed LR Scheduler = None
[2024-01-03 14:05:35,711] [INFO] [] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[(0.9, 0.999)]
[2024-01-03 14:05:35,714] [INFO] [] DeepSpeedEngine configuration:
[2024-01-03 14:05:35,715] [INFO] [] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
[2024-01-03 14:05:35,715] [INFO] [] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-01-03 14:05:35,715] [INFO] [] amp_enabled .................. False
[2024-01-03 14:05:35,715] [INFO] [] amp_params ................... False
[2024-01-03 14:05:35,715] [INFO] [] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
[2024-01-03 14:05:35,715] [INFO] [] bfloat16_enabled ............. True
[2024-01-03 14:05:35,716] [INFO] [] checkpoint_parallel_write_pipeline False
[2024-01-03 14:05:35,716] [INFO] [] checkpoint_tag_validation_enabled True
[2024-01-03 14:05:35,716] [INFO] [] checkpoint_tag_validation_fail False
[2024-01-03 14:05:35,716] [INFO] [] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x40047d4088e0>
[2024-01-03 14:05:35,716] [INFO] [] communication_data_type ...... None
[2024-01-03 14:05:35,716] [INFO] [] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-01-03 14:05:35,716] [INFO] [] curriculum_enabled_legacy .... False
[2024-01-03 14:05:35,716] [INFO] [] curriculum_params_legacy ..... False
[2024-01-03 14:05:35,716] [INFO] [] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-01-03 14:05:35,716] [INFO] [] data_efficiency_enabled ...... False
[2024-01-03 14:05:35,716] [INFO] [] dataloader_drop_last ......... False
[2024-01-03 14:05:35,716] [INFO] [] disable_allgather ............ False
[2024-01-03 14:05:35,716] [INFO] [] dump_state ................... False
[2024-01-03 14:05:35,716] [INFO] [] dynamic_loss_scale_args ...... None
[2024-01-03 14:05:35,716] [INFO] [] eigenvalue_enabled ........... False
[2024-01-03 14:05:35,716] [INFO] [] eigenvalue_gas_boundary_resolution 1
[2024-01-03 14:05:35,716] [INFO] [] eigenvalue_layer_name ........ bert.encoder.layer
[2024-01-03 14:05:35,717] [INFO] [] eigenvalue_layer_num ......... 0
[2024-01-03 14:05:35,717] [INFO] [] eigenvalue_max_iter .......... 100
[2024-01-03 14:05:35,717] [INFO] [] eigenvalue_stability ......... 1e-06
[2024-01-03 14:05:35,717] [INFO] [] eigenvalue_tol ............... 0.01
[2024-01-03 14:05:35,717] [INFO] [] eigenvalue_verbose ........... False
[2024-01-03 14:05:35,717] [INFO] [] elasticity_enabled ........... False
[2024-01-03 14:05:35,717] [INFO] [] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
[2024-01-03 14:05:35,717] [INFO] [] fp16_auto_cast ............... None
[2024-01-03 14:05:35,717] [INFO] [] fp16_enabled ................. False
[2024-01-03 14:05:35,717] [INFO] [] fp16_master_weights_and_gradients False
[2024-01-03 14:05:35,717] [INFO] [] global_rank .................. 0
[2024-01-03 14:05:35,717] [INFO] [] grad_accum_dtype ............. None
[2024-01-03 14:05:35,717] [INFO] [] gradient_accumulation_steps .. 1
[2024-01-03 14:05:35,717] [INFO] [] gradient_clipping ............ 1.0
[2024-01-03 14:05:35,717] [INFO] [] gradient_predivide_factor .... 1.0
[2024-01-03 14:05:35,718] [INFO] [] graph_harvesting ............. False
[2024-01-03 14:05:35,718] [INFO] [] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-01-03 14:05:35,718] [INFO] [] initial_dynamic_scale ........ 1
[2024-01-03 14:05:35,718] [INFO] [] load_universal_checkpoint .... False
[2024-01-03 14:05:35,718] [INFO] [] loss_scale ................... 1.0
[2024-01-03 14:05:35,718] [INFO] [] memory_breakdown ............. False
[2024-01-03 14:05:35,718] [INFO] [] mics_hierarchial_params_gather False
[2024-01-03 14:05:35,718] [INFO] [] mics_shard_size .............. -1
[2024-01-03 14:05:35,718] [INFO] [] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-01-03 14:05:35,718] [INFO] [] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
[2024-01-03 14:05:35,718] [INFO] [] optimizer_legacy_fusion ...... False
[2024-01-03 14:05:35,718] [INFO] [] optimizer_name ............... None
[2024-01-03 14:05:35,718] [INFO] [] optimizer_params ............. None
[2024-01-03 14:05:35,718] [INFO] [] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-01-03 14:05:35,718] [INFO] [] pld_enabled .................. False
[2024-01-03 14:05:35,718] [INFO] [] pld_params ................... False
[2024-01-03 14:05:35,718] [INFO] [] prescale_gradients ........... False
[2024-01-03 14:05:35,719] [INFO] [] scheduler_name ............... None
[2024-01-03 14:05:35,719] [INFO] [] scheduler_params ............. None
[2024-01-03 14:05:35,719] [INFO] [] seq_parallel_communication_data_type torch.float32
[2024-01-03 14:05:35,719] [INFO] [] sparse_attention ............. None
[2024-01-03 14:05:35,719] [INFO] [] sparse_gradients_enabled ..... False
[2024-01-03 14:05:35,719] [INFO] [] steps_per_print .............. inf
[2024-01-03 14:05:35,719] [INFO] [] train_batch_size ............. 4
[2024-01-03 14:05:35,719] [INFO] [] train_micro_batch_size_per_gpu 1
[2024-01-03 14:05:35,719] [INFO] [] use_data_before_expert_parallel_ False
[2024-01-03 14:05:35,719] [INFO] [] use_node_local_storage ....... False
[2024-01-03 14:05:35,719] [INFO] [] wall_clock_breakdown ......... False
[2024-01-03 14:05:35,719] [INFO] [] weight_quantization_config ... None
[2024-01-03 14:05:35,719] [INFO] [] world_size ................... 4
[2024-01-03 14:05:35,719] [INFO] [] zero_allow_untested_optimizer True
[2024-01-03 14:05:35,719] [INFO] [] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-01-03 14:05:35,719] [INFO] [] zero_enabled ................. True
[2024-01-03 14:05:35,719] [INFO] [] zero_force_ds_cpu_optimizer .. True
[2024-01-03 14:05:35,719] [INFO] [] zero_optimization_stage ...... 2
[2024-01-03 14:05:35,720] [INFO] [] json = {
"train_batch_size": 4,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"overlap_comm": false,
"contiguous_gradients": true
"steps_per_print": inf,
"bf16": {
"enabled": true
[INFO|] 2024-01-03 14:05:35,720 >> ***** Running training *****
[INFO|] 2024-01-03 14:05:35,720 >> Num examples = 68,658
[INFO|] 2024-01-03 14:05:35,720 >> Num Epochs = 1
[INFO|] 2024-01-03 14:05:35,720 >> Instantaneous batch size per device = 1
[INFO|] 2024-01-03 14:05:35,720 >> Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|] 2024-01-03 14:05:35,720 >> Gradient Accumulation steps = 1
[INFO|] 2024-01-03 14:05:35,720 >> Total optimization steps = 17,165
[INFO|] 2024-01-03 14:05:35,724 >> Number of trainable parameters = 59,293,696
0%| | 0/17165 [00:00<?, ?it/s][2024-01-03 14:05:36,090] [INFO] [] Killing subprocess 2726715
[2024-01-03 14:05:36,090] [INFO] [] Killing subprocess 2726716
[2024-01-03 14:05:36,302] [INFO] [] Killing subprocess 2726717
[2024-01-03 14:05:36,598] [INFO] [] Killing subprocess 2726718
[2024-01-03 14:05:36,716] [ERROR] [] ['/home/bingxing2/home/scx9042/.conda/envs/myLLM/bin/python', '-u', 'src/', '--local_rank=3', '--deepspeed', 'ds_config.json', '--stage', 'sft', '--model_name_or_path', 'ChatGLM3-6B', '--do_train', '--dataset', 'allSemiDataNewMerge', '--template', 'chatglm3', '--finetuning_type', 'lora', '--lora_rank', '32', '--lora_target', 'all', '--cutoff_len', '1024', '--output_dir', 'outputs_ChatGLM3_lora_nopt_sft_haha', '--overwrite_cache', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--save_steps', '50', '--save_total_limit', '3', '--learning_rate', '5e-5', '--num_train_epochs', '1', '--plot_loss', '--seed', '42', '--bf16', '--ddp_timeout', '30000', '--report_to', 'tensorboard', '--ddp_find_unused_parameters', 'False', '--gradient_checkpointing', 'True'] exits with return code = -7
Expected behavior
No response
System Info
No response
No response
The text was updated successfully, but these errors were encountered: