[BUG] RuntimeError: still have inflight params in KTO #1732

iszengxin · 2024-06-13T12:10:33Z

hello, I use LLam factory https://github.com/hiyouga/LLaMA-Factory run of KTO task,
the package version are as follow:

transformers              4.41.2  
trl                       0.9.4 
deepspeed               0.13.0

I use deepspeed Zero 3, and the config of deepspeed are as follows:


{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "zero_allow_untested_optimizer": true,
    "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
    },
    "bf16": {
      "enabled": "auto"
    },
    "optimizer": {
      "type": "AdamW",
      "params": {
          "lr": "auto",
          "betas": "auto",
          "eps": "auto",
          "weight_decay": "auto"
      }
    },
    "zero_optimization": {
      "stage": 3,
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_gather_16bit_weights_on_model_save": true
    }
  }

I run the defaut KTO set in LLaMA-Factory, and get this problem during the validation of training:

Traceback (most recent call last):
  File "src/train.py", line 14, in <module>
    main()
  File "src/train.py", line 5, in main
    run_exp()
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 41, in run_exp
    run_kto(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/kto/workflow.py", line 59, in run_kto
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2291, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2721, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 3572, in evaluate
    output = eval_loop(
  File "/usr/local/lib/python3.8/dist-packages/trl/trainer/kto_trainer.py", line 1514, in evaluation_loop
    initial_output = super().evaluation_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 3757, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/usr/local/lib/python3.8/dist-packages/trl/trainer/kto_trainer.py", line 1444, in prediction_step
    loss, metrics = self.get_batch_loss_metrics(model, inputs)
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/kto/trainer.py", line 173, in get_batch_loss_metrics
    reference_chosen_logps, reference_rejected_logps, reference_kl_logps = self.compute_reference_log_probs(
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/kto/trainer.py", line 155, in compute_reference_log_probs
    reference_chosen_logps, reference_rejected_logps, reference_kl_logps, _ = self.concatenated_forward(
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/kto/trainer.py", line 129, in concatenated_forward
    target_logps, target_logps_avg = self.forward(model, batch)
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/kto/trainer.py", line 121, in forward
    logits = model(**model_inputs, return_dict=True, use_cache=False).logits.to(torch.float32)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1581, in _call_impl
    hook_result = hook(self, args, result)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 232, in _end_of_forward_hook
    self.get_param_coordinator(training=False).reset_step()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 208, in reset_step
    raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [{'id': 35, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}]

It is OK for DPO task, How did this problem arise and how to fix it ?
Thanks very much。

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-13T15:04:57Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

qgallouedec · 2024-08-22T10:04:59Z

Hi, thanks for sharing this error.
Maybe related deepspeedai/DeepSpeed#3156?
Unfortunately we don't maintain https://github.com/hiyouga/LLaMA-Factory here. Perhaps you'll get a better answer if you ask your question directly there.
If the problem is related to trl, we need the exact minimum steps required to reproduce it (code, data, model etc.).
I'm closing this issue. If the problem persists, please reopen another by following the recommendation above.

github-actions bot closed this as completed Jul 21, 2024

qgallouedec reopened this Aug 22, 2024

qgallouedec closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RuntimeError: still have inflight params in KTO #1732

[BUG] RuntimeError: still have inflight params in KTO #1732

iszengxin commented Jun 13, 2024 •

edited

Loading

github-actions bot commented Jul 13, 2024

qgallouedec commented Aug 22, 2024

[BUG] RuntimeError: still have inflight params in KTO #1732

[BUG] RuntimeError: still have inflight params in KTO #1732

Comments

iszengxin commented Jun 13, 2024 • edited Loading

github-actions bot commented Jul 13, 2024

qgallouedec commented Aug 22, 2024

iszengxin commented Jun 13, 2024 •

edited

Loading