Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Resolve memory not logged when missing metrics #8174

Merged
merged 6 commits into from
Jun 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

- Fixed a bug where an infinite recursion would be triggered when using the `BaseFinetuning` callback on a model that contains a `ModuleDict` ([#8170](https://github.com/PyTorchLightning/pytorch-lightning/pull/8170))


- Fixed `log_gpu_memory` metrics not being added to `logging` when nothing else is logged ([#8174](https://github.com/PyTorchLightning/pytorch-lightning/pull/8174))


## [1.3.7] - 2021-06-22

- Fixed a bug where skipping an optimizer while using amp causes amp to trigger an assertion error ([#7975](https://github.com/PyTorchLightning/pytorch-lightning/pull/7975))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ def __init__(self, trainer: 'pl.Trainer', log_gpu_memory: Optional[str] = None)
self._progress_bar_metrics: Dict[str, float] = {}
self._logged_metrics: Dict[str, _METRIC] = {}
self._callback_metrics: Dict[str, _METRIC] = {}
self._gpus_metrics: Dict[str, str] = {}
self._epoch_end_reached = False
self._current_fx: Optional[str] = None
self._batch_idx: Optional[int] = None
Expand Down Expand Up @@ -94,11 +95,6 @@ def log_metrics(self, metrics: Dict[str, _METRIC], step: Optional[int] = None) -
if self.trainer.logger is None or not metrics:
return

# add gpu memory
if self.trainer._device_type == DeviceType.GPU and self.log_gpu_memory:
mem_map = memory.get_memory_profile(self.log_gpu_memory)
metrics.update(mem_map)

# turn all tensors to scalars
scalar_metrics = metrics_to_scalars(metrics)

Expand Down Expand Up @@ -213,6 +209,8 @@ def update_train_step_metrics(self) -> None:
if self.trainer.fit_loop.should_accumulate() and self.trainer.lightning_module.automatic_optimization:
return

self._log_gpus_metrics()

# when metrics should be logged
assert not self._epoch_end_reached
if self.should_update_logs or self.trainer.fast_dev_run:
Expand All @@ -226,6 +224,12 @@ def update_train_epoch_metrics(self) -> None:
# reset result collection for next epoch
self.trainer._results.reset(metrics=True)

def _log_gpus_metrics(self):
for key, mem in self.gpus_metrics.items():
gpu_id = int(key.split('/')[0].split(':')[1])
if gpu_id in self.trainer.accelerator_connector.parallel_device_ids:
tchaton marked this conversation as resolved.
Show resolved Hide resolved
self.trainer.lightning_module.log(key, mem, prog_bar=False, logger=True, on_step=True, on_epoch=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're already in the trainer, why do we have to log through the lightning module's log ?


"""
Utilities and properties
"""
Expand Down Expand Up @@ -276,6 +280,13 @@ def metrics(self) -> Dict[MetricSource, Dict[str, _METRIC]]:
on_step = not self._epoch_end_reached
return self.trainer._results.metrics(on_step)

@property
def gpus_metrics(self) -> Dict[str, str]:
if self.trainer._device_type == DeviceType.GPU and self.log_gpu_memory:
mem_map = memory.get_memory_profile(self.log_gpu_memory)
self._gpus_metrics.update(mem_map)
return self._gpus_metrics

Comment on lines +283 to +289
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this PR need to add _gpu_metrics? Doesn't seem related to the issue linked.

This means that gpu metrics are now duplicated in this dictionary and in logged metrics.

Also it only gets filled when self.log_gpu_memory so it can't be used anyways without the flag.

Copy link

@thomasw21 thomasw21 Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this broke log_gpu_memory="min_max" option in Trainer.fit. Looks related to: #9010

Essentially memory. get_memory_profile adds two keys that are not in conventional format min_gpu_mem and max_gpu_mem (typically keys are in f"gpu_id: {gpu_id}/memory.used (MB)"

I see #9013 fixed it.

@property
def callback_metrics(self) -> Dict[str, _METRIC]:
if self.trainer._results:
Expand Down
18 changes: 18 additions & 0 deletions tests/trainer/logging_/test_train_loop_logging.py
Original file line number Diff line number Diff line change
Expand Up @@ -712,3 +712,21 @@ def training_step(self, *args):
model = TestModel()
with pytest.raises(MisconfigurationException, match=r'reduce_fx={min,max,mean,sum}\)` are currently supported'):
trainer.fit(model)


@RunIf(min_gpus=2)
def test_log_gpu_memory_without_logging_on_step(tmpdir):

model = BoringModel()
trainer = Trainer(
default_root_dir=tmpdir,
max_epochs=1,
limit_train_batches=1,
limit_val_batches=0,
log_gpu_memory='all',
log_every_n_steps=1,
gpus=[1]
)
trainer.fit(model)

assert 'gpu_id: 1/memory.used (MB)' in trainer.logged_metrics