Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RuntimeError: still have inflight params in KTO #1732

Closed
iszengxin opened this issue Jun 13, 2024 · 2 comments
Closed

[BUG] RuntimeError: still have inflight params in KTO #1732

iszengxin opened this issue Jun 13, 2024 · 2 comments

Comments

@iszengxin
Copy link

iszengxin commented Jun 13, 2024

hello, I use LLam factory https://github.com/hiyouga/LLaMA-Factory run of KTO task,
the package version are as follow:

transformers              4.41.2  
trl                       0.9.4 
deepspeed               0.13.0

I use deepspeed Zero 3, and the config of deepspeed are as follows:


{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "zero_allow_untested_optimizer": true,
    "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
    },
    "bf16": {
      "enabled": "auto"
    },
    "optimizer": {
      "type": "AdamW",
      "params": {
          "lr": "auto",
          "betas": "auto",
          "eps": "auto",
          "weight_decay": "auto"
      }
    },
    "zero_optimization": {
      "stage": 3,
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_gather_16bit_weights_on_model_save": true
    }
  }

I run the defaut KTO set in LLaMA-Factory, and get this problem during the validation of training:

Traceback (most recent call last):
  File "src/train.py", line 14, in <module>
    main()
  File "src/train.py", line 5, in main
    run_exp()
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 41, in run_exp
    run_kto(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/kto/workflow.py", line 59, in run_kto
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2291, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2721, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 3572, in evaluate
    output = eval_loop(
  File "/usr/local/lib/python3.8/dist-packages/trl/trainer/kto_trainer.py", line 1514, in evaluation_loop
    initial_output = super().evaluation_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 3757, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/usr/local/lib/python3.8/dist-packages/trl/trainer/kto_trainer.py", line 1444, in prediction_step
    loss, metrics = self.get_batch_loss_metrics(model, inputs)
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/kto/trainer.py", line 173, in get_batch_loss_metrics
    reference_chosen_logps, reference_rejected_logps, reference_kl_logps = self.compute_reference_log_probs(
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/kto/trainer.py", line 155, in compute_reference_log_probs
    reference_chosen_logps, reference_rejected_logps, reference_kl_logps, _ = self.concatenated_forward(
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/kto/trainer.py", line 129, in concatenated_forward
    target_logps, target_logps_avg = self.forward(model, batch)
  File "/maindata/data/user/xin.zeng/workspace/code_git/kto/LLaMA-Factory-main/src/llamafactory/train/kto/trainer.py", line 121, in forward
    logits = model(**model_inputs, return_dict=True, use_cache=False).logits.to(torch.float32)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1581, in _call_impl
    hook_result = hook(self, args, result)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 232, in _end_of_forward_hook
    self.get_param_coordinator(training=False).reset_step()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 208, in reset_step
    raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [{'id': 35, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}]

It is OK for DPO task, How did this problem arise and how to fix it ?
Thanks very much。

Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@qgallouedec qgallouedec reopened this Aug 22, 2024
@qgallouedec
Copy link
Member

Hi, thanks for sharing this error.
Maybe related deepspeedai/DeepSpeed#3156?
Unfortunately we don't maintain https://github.com/hiyouga/LLaMA-Factory here. Perhaps you'll get a better answer if you ask your question directly there.
If the problem is related to trl, we need the exact minimum steps required to reproduce it (code, data, model etc.).
I'm closing this issue. If the problem persists, please reopen another by following the recommendation above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants