Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlflow duplicate logging #2063

Closed
6 of 8 tasks
jsh2581 opened this issue Nov 15, 2024 · 3 comments · Fixed by #2109
Closed
6 of 8 tasks

Mlflow duplicate logging #2063

jsh2581 opened this issue Nov 15, 2024 · 3 comments · Fixed by #2109
Labels
bug Something isn't working

Comments

@jsh2581
Copy link

jsh2581 commented Nov 15, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

one log in one step

Current behaviour

duplicated log in one step

image

Steps to reproduce

  1. pull docker image : winglian/axolotl:main-20241030-py3.11-cu124-2.4.1
  2. setup mlflow (ghcr.io/mlflow/mlflow:v2.17.2)
  3. run axolotl docker
  4. prepare dataset, base model, train config file.
  5. run accelerate launch -m axolotl.cli.train my_config.yml
  6. go to mlflow logging dir
  7. check the log file

Config yaml

base_model: meta-llama/Llama-3.2-3B
plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: false

strict: false
chat_template:
output_dir: /workspace/axolotl/3_model/pretraining
skip_prepare_dataset: true
datasets:
  - path: /workspace/axolotl/2_data/dataset-tokenized-8k/train
    split: train
    type:

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: false

# mlflow configuration if you're using it
mlflow_tracking_uri: http://mlflow-server:5000
mlflow_experiment_name: llama-3B
mlflow_run_name: llama-3B

gradient_accumulation_steps: 1
micro_batch_size: 2
  # num_epochs: 1
# max_steps: 200000 
optimizer: adamw_torch
lr_scheduler: cosine
lr_scheduler_kwargs:
cosine_min_lr_ratio: 1e-3

learning_rate: 1e-5

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
#flash_attention: true

warmup_steps: 20000
  #evals_per_epoch: 2
eval_table_size:

save_steps: 40000
debug:
deepspeed:
weight_decay: 0.0
fsdp:
#   - full_shard
#   - auto_wrap
# fsdp_config:
#   fsdp_limit_all_gathers: true
#   fsdp_sync_module_states: true
#   fsdp_offload_params: false
#   fsdp_use_orig_params: false
#   fsdp_cpu_ram_efficient_loading: true
#   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
#   fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
#   fsdp_state_dict_type: FULL_STATE_DICT
#   fsdp_sharding_strategy: FULL_SHARD
#   fsdp_backward_prefetch: BACKWARD_PRE
special_tokens:
  pad_token: <|end_of_text|>

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.11

axolotl branch-commit

main/8c3a727f9d60ffd3af385f90bcc3fa3a56398fe1

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@jsh2581 jsh2581 added the bug Something isn't working label Nov 15, 2024
@NanoCode012
Copy link
Collaborator

cc @awhazell , do you perhaps see any duplicate logging recently to mlflow?

@awhazell
Copy link
Contributor

The MLFlowCallback may be being duplicated- it's added explicitly here but also by the reports_to argument to the trainer I think. The fix may just be to delete the addition of that callback in the linked code.

I can't see any duplicate logs in my version though- I'm using postgres as a backend store which might deal with duplicates somehow

@NanoCode012
Copy link
Collaborator

@awhazell , thanks for catching that.

@jsh2581 , would you be able to try out the linked PR to see if it solves it for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants