You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed a significantly degraded performance with tensorboard logger on S3.
I printede the call stack of the tensorboard logger's flush call, and found that, on every call to log_metrics, tensorboard's flush will be called.
trainer.fit(lit_model, data)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
call._call_and_handle_interrupt(
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
results = self._run_stage()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
self.fit_loop.run()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
self.advance()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
self.advance(data_fetcher)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 278, in advance
trainer._logger_connector.update_train_step_metrics()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 163, in update_train_step_metrics
self.log_metrics(self.metrics["log"])
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 118, in log_metrics
logger.save()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loggers/tensorboard.py", line 210, in save
super().save()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/fabric/loggers/tensorboard.py", line 290, in save
self.experiment.flush()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/torch/utils/tensorboard/writer.py", line 1194, in flush
writer.flush()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/torch/utils/tensorboard/writer.py", line 153, in flush
self.event_writer.flush()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 127, in flush
self._async_writer.flush()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 185, in flush
traceback.print_stack()
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.4.0
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux): Linux
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
The text was updated successfully, but these errors were encountered:
and the performance is normal. The events file is still being written frequently, but by tensorboard's async writer, and the performance is not effected.
Bug description
I noticed a significantly degraded performance with tensorboard logger on S3.
I printede the call stack of the tensorboard logger's flush call, and found that, on every call to
log_metrics
, tensorboard'sflush
will be called.What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
Environment
Current environment
The text was updated successfully, but these errors were encountered: