Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard Logger is flushed on every step #20551

Open
leoleoasd opened this issue Jan 17, 2025 · 1 comment
Open

Tensorboard Logger is flushed on every step #20551

leoleoasd opened this issue Jan 17, 2025 · 1 comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x

Comments

@leoleoasd
Copy link
Contributor

leoleoasd commented Jan 17, 2025

Bug description

I noticed a significantly degraded performance with tensorboard logger on S3.
I printede the call stack of the tensorboard logger's flush call, and found that, on every call to log_metrics, tensorboard's flush will be called.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

    logger = TensorBoardLogger("s3-mountpoint", max_queue=1000, flush_secs=20)
    trainer = L.Trainer(
        num_nodes=num_nodes,
        devices=local_world_size,
        accelerator="cuda",
        max_epochs=1,
        precision="bf16-true",
        strategy="fsdp",
        log_every_n_steps=1,
        enable_checkpointing=False,
        default_root_dir="mountpoint",
        logger=logger,
    )

Error messages and logs

    trainer.fit(lit_model, data)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 278, in advance
    trainer._logger_connector.update_train_step_metrics()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 163, in update_train_step_metrics
    self.log_metrics(self.metrics["log"])
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 118, in log_metrics
    logger.save()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loggers/tensorboard.py", line 210, in save
    super().save()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/fabric/loggers/tensorboard.py", line 290, in save
    self.experiment.flush()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/torch/utils/tensorboard/writer.py", line 1194, in flush
    writer.flush()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/torch/utils/tensorboard/writer.py", line 153, in flush
    self.event_writer.flush()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 127, in flush
    self._async_writer.flush()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 185, in flush
    traceback.print_stack()

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.4.0
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux): Linux
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
@leoleoasd leoleoasd added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jan 17, 2025
@leoleoasd
Copy link
Contributor Author

I commented out this line


and the performance is normal. The events file is still being written frequently, but by tensorboard's async writer, and the performance is not effected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x
Projects
None yet
Development

No branches or pull requests

1 participant