Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chatglm3-6b 用DeepSpeed微调报错exits with return code = -7,但用单卡微调没问题。参考了issue#1683的方法也没用 #2061

Closed
1 task done
wumin86 opened this issue Jan 3, 2024 · 4 comments
Labels
solved This problem has been already solved

Comments

@wumin86
Copy link

wumin86 commented Jan 3, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

训练命令如下,其中参考了issue#1683的方法加上export NCCL_P2P_LEVEL=NVL也没用
#!/bin/bash
module load anaconda cudnn/8.6.0.163_cuda11.x compilers/cuda/11.8 compilers/gcc/11.3.0
source activate myLLM
export PYTHONUNBUFFERED=1
export NCCL_P2P_LEVEL=NVL
deepspeed --num_gpus 4 --master_port=9901 src/train_bash.py
--deepspeed ds_config.json
--stage sft
--model_name_or_path ChatGLM3-6B
--do_train
--dataset allSemiDataNewMerge
--template chatglm3
--finetuning_type lora
--lora_rank 32
--lora_target all
--cutoff_len 1024
--output_dir outputs_ChatGLM3_lora_nopt_sft
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 50
--save_total_limit 3
--learning_rate 5e-5
--num_train_epochs 1
--plot_loss
--seed 42
--bf16
--ddp_timeout 30000
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True \

ds_config.json 文件如下

{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"overlap_comm": false,
"contiguous_gradients": true
}
}

日志如下:

Running tokenizer on dataset: 100%|██████████| 68658/68658 [04:18<00:00, 203.98 examples/s]
Running tokenizer on dataset: 100%|██████████| 68658/68658 [04:19<00:00, 201.93 examples/s]
Running tokenizer on dataset: 100%|██████████| 68658/68658 [04:19<00:00, 264.99 examples/s]
/home/bingxing2/home/scx9042/.conda/envs/myLLM/lib/python3.10/site-packages/transformers/training_args.py:1751: FutureWarning: --push_to_hub_token is deprecated and will be removed in version 5 of 🤗 Transformers. Use --hub_token instead.
warnings.warn(

Running tokenizer on dataset: 100%|██████████| 68658/68658 [04:19<00:00, 264.85 examples/s]
/home/bingxing2/home/scx9042/.conda/envs/myLLM/lib/python3.10/site-packages/transformers/training_args.py:1751: FutureWarning: --push_to_hub_token is deprecated and will be removed in version 5 of 🤗 Transformers. Use --hub_token instead.
warnings.warn(
[WARNING|modeling_utils.py:2045] 2024-01-03 14:05:27,621 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
[WARNING|modeling_utils.py:2045] 2024-01-03 14:05:27,870 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
[2024-01-03 14:05:33,233] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-01-03 14:05:33,239] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-01-03 14:05:33,239] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-01-03 14:05:33,263] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-01-03 14:05:33,263] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-01-03 14:05:33,264] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-01-03 14:05:33,264] [INFO] [stage_1_and_2.py:148:init] Reduce bucket size 500000000
[2024-01-03 14:05:33,264] [INFO] [stage_1_and_2.py:149:init] Allgather bucket size 500000000
[2024-01-03 14:05:33,264] [INFO] [stage_1_and_2.py:150:init] CPU Offload: False
[2024-01-03 14:05:33,264] [INFO] [stage_1_and_2.py:151:init] Round robin gradient partitioning: False
[2024-01-03 14:05:35,314] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
[2024-01-03 14:05:35,315] [INFO] [utils.py:792:see_memory_usage] MA 11.82 GB Max_MA 11.85 GB CA 11.87 GB Max_CA 12 GB
[2024-01-03 14:05:35,315] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 76.4 GB, percent = 30.1%
[2024-01-03 14:05:35,519] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
[2024-01-03 14:05:35,520] [INFO] [utils.py:792:see_memory_usage] MA 11.93 GB Max_MA 12.1 GB CA 12.15 GB Max_CA 12 GB
[2024-01-03 14:05:35,521] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 76.52 GB, percent = 30.1%
[2024-01-03 14:05:35,521] [INFO] [stage_1_and_2.py:516:init] optimizer state initialized
[2024-01-03 14:05:35,706] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
[2024-01-03 14:05:35,707] [INFO] [utils.py:792:see_memory_usage] MA 11.93 GB Max_MA 11.93 GB CA 12.15 GB Max_CA 12 GB
[2024-01-03 14:05:35,707] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 76.51 GB, percent = 30.1%
[2024-01-03 14:05:35,710] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2024-01-03 14:05:35,711] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-01-03 14:05:35,711] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-01-03 14:05:35,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[(0.9, 0.999)]
[2024-01-03 14:05:35,714] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
[2024-01-03 14:05:35,715] [INFO] [config.py:988:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-01-03 14:05:35,715] [INFO] [config.py:988:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-01-03 14:05:35,715] [INFO] [config.py:988:print] amp_enabled .................. False
[2024-01-03 14:05:35,715] [INFO] [config.py:988:print] amp_params ................... False
[2024-01-03 14:05:35,715] [INFO] [config.py:988:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-01-03 14:05:35,715] [INFO] [config.py:988:print] bfloat16_enabled ............. True
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] checkpoint_parallel_write_pipeline False
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] checkpoint_tag_validation_enabled True
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] checkpoint_tag_validation_fail False
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x40047d4088e0>
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] communication_data_type ...... None
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] curriculum_enabled_legacy .... False
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] curriculum_params_legacy ..... False
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] data_efficiency_enabled ...... False
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] dataloader_drop_last ......... False
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] disable_allgather ............ False
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] dump_state ................... False
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] dynamic_loss_scale_args ...... None
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] eigenvalue_enabled ........... False
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] eigenvalue_gas_boundary_resolution 1
[2024-01-03 14:05:35,716] [INFO] [config.py:988:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] eigenvalue_layer_num ......... 0
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] eigenvalue_max_iter .......... 100
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] eigenvalue_stability ......... 1e-06
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] eigenvalue_tol ............... 0.01
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] eigenvalue_verbose ........... False
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] elasticity_enabled ........... False
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] fp16_auto_cast ............... None
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] fp16_enabled ................. False
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] fp16_master_weights_and_gradients False
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] global_rank .................. 0
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] grad_accum_dtype ............. None
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] gradient_accumulation_steps .. 1
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] gradient_clipping ............ 1.0
[2024-01-03 14:05:35,717] [INFO] [config.py:988:print] gradient_predivide_factor .... 1.0
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] graph_harvesting ............. False
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] initial_dynamic_scale ........ 1
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] load_universal_checkpoint .... False
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] loss_scale ................... 1.0
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] memory_breakdown ............. False
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] mics_hierarchial_params_gather False
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] mics_shard_size .............. -1
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] optimizer_legacy_fusion ...... False
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] optimizer_name ............... None
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] optimizer_params ............. None
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] pld_enabled .................. False
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] pld_params ................... False
[2024-01-03 14:05:35,718] [INFO] [config.py:988:print] prescale_gradients ........... False
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] scheduler_name ............... None
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] scheduler_params ............. None
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] seq_parallel_communication_data_type torch.float32
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] sparse_attention ............. None
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] sparse_gradients_enabled ..... False
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] steps_per_print .............. inf
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] train_batch_size ............. 4
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] train_micro_batch_size_per_gpu 1
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] use_data_before_expert_parallel_ False
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] use_node_local_storage ....... False
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] wall_clock_breakdown ......... False
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] weight_quantization_config ... None
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] world_size ................... 4
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] zero_allow_untested_optimizer True
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] zero_enabled ................. True
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] zero_force_ds_cpu_optimizer .. True
[2024-01-03 14:05:35,719] [INFO] [config.py:988:print] zero_optimization_stage ...... 2
[2024-01-03 14:05:35,720] [INFO] [config.py:974:print_user_config] json = {
"train_batch_size": 4,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"overlap_comm": false,
"contiguous_gradients": true
},
"steps_per_print": inf,
"bf16": {
"enabled": true
}
}
[INFO|trainer.py:1706] 2024-01-03 14:05:35,720 >> ***** Running training *****
[INFO|trainer.py:1707] 2024-01-03 14:05:35,720 >> Num examples = 68,658
[INFO|trainer.py:1708] 2024-01-03 14:05:35,720 >> Num Epochs = 1
[INFO|trainer.py:1709] 2024-01-03 14:05:35,720 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1712] 2024-01-03 14:05:35,720 >> Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:1713] 2024-01-03 14:05:35,720 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1714] 2024-01-03 14:05:35,720 >> Total optimization steps = 17,165
[INFO|trainer.py:1715] 2024-01-03 14:05:35,724 >> Number of trainable parameters = 59,293,696

0%| | 0/17165 [00:00<?, ?it/s][2024-01-03 14:05:36,090] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2726715
[2024-01-03 14:05:36,090] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2726716
[2024-01-03 14:05:36,302] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2726717
[2024-01-03 14:05:36,598] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2726718
[2024-01-03 14:05:36,716] [ERROR] [launch.py:321:sigkill_handler] ['/home/bingxing2/home/scx9042/.conda/envs/myLLM/bin/python', '-u', 'src/train_bash.py', '--local_rank=3', '--deepspeed', 'ds_config.json', '--stage', 'sft', '--model_name_or_path', 'ChatGLM3-6B', '--do_train', '--dataset', 'allSemiDataNewMerge', '--template', 'chatglm3', '--finetuning_type', 'lora', '--lora_rank', '32', '--lora_target', 'all', '--cutoff_len', '1024', '--output_dir', 'outputs_ChatGLM3_lora_nopt_sft_haha', '--overwrite_cache', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--save_steps', '50', '--save_total_limit', '3', '--learning_rate', '5e-5', '--num_train_epochs', '1', '--plot_loss', '--seed', '42', '--bf16', '--ddp_timeout', '30000', '--report_to', 'tensorboard', '--ddp_find_unused_parameters', 'False', '--gradient_checkpointing', 'True'] exits with return code = -7

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga hiyouga added the pending This problem is yet to be addressed label Jan 3, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jan 9, 2024

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jan 9, 2024
@hiyouga hiyouga closed this as completed Jan 9, 2024
@wumin86
Copy link
Author

wumin86 commented Jan 10, 2024

microsoft/DeepSpeed#4002

这些解决方法是增加docker共享内存,但我在超算云上跑的,没用docker,怎么解决?

@hiyouga
Copy link
Owner

hiyouga commented Jan 10, 2024

多申请一些内存

@luolanfeixue
Copy link

遇到相同的问题,docker共享内存还有600G+,cpu内存920G,cpu内存利用率也没出现峰值,感觉不是docker共享内存和cpu内存的问题。要怎么解决呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants