Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we please add Timestamps to Each Epoch Checkpoint DEBUG line ? #665

Closed
coffeecodeconverter opened this issue Nov 30, 2024 · 1 comment

Comments

@coffeecodeconverter
Copy link

can we please add a timestamp to the DEBUG line for each epoch checkpoint?
so instead of just having this:

DEBUG:fsspec.local:open file: /pipertts/train-me/lightning_logs/version_5/checkpoints/epoch=804-step=994862.ckpt
DEBUG:fsspec.local:open file: /pipertts/train-me/lightning_logs/version_5/checkpoints/epoch=809-step=994902.ckpt
DEBUG:fsspec.local:open file: /pipertts/train-me/lightning_logs/version_5/checkpoints/epoch=814-step=994942.ckpt

can we please have it read as:

DEBUG:fsspec.local: YYYY-MM-DD HH:mm:ss.uuu - open file: /pipertts/train-me/lightning_logs/version_5/checkpoints/epoch=804-step=994862.ckpt
DEBUG:fsspec.local: YYYY-MM-DD HH:mm:ss.uuu - open file: /pipertts/train-me/lightning_logs/version_5/checkpoints/epoch=809-step=994902.ckpt
DEBUG:fsspec.local: YYYY-MM-DD HH:mm:ss.uuu - open file: /pipertts/train-me/lightning_logs/version_5/checkpoints/epoch=814-step=994942.ckpt
@coffeecodeconverter
Copy link
Author

coffeecodeconverter commented Nov 30, 2024

ive done some more digging, and it looks like its part of the pytorch lightning packages, so not really pipers issue.
nonetheless, ive managed to achieve the above though,
plus some other minor tweaks that make the output more readable.

ive tweaked 3 x files
/PiperTTS/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py
/PiperTTS/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py
/PiperTTS/.venv/lib/python3.10/site-packages/fsspec/implementations/local.py

.
.
.
.
firstly, in checkpoint_connector.py
(/PiperTTS/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py)
to give it a timestamp, and a line break between the message and the file path,
ive amended this
(right near the top)

from pytorch_lightning.utilities.migration.utils import _pl_migrate_checkpoint
from pytorch_lightning.utilities.rank_zero import rank_zero_deprecation, rank_zero_info, rank_zero_warn

with this
(to import datetime)

from pytorch_lightning.utilities.migration.utils import _pl_migrate_checkpoint
from pytorch_lightning.utilities.rank_zero import rank_zero_deprecation, rank_zero_info, rank_zero_warn

from datetime import datetime

then replaced this output message:

rank_zero_info(f"Restored all states from the checkpoint file at {self.resume_checkpoint_path}")

with this:
(gives it a timestamp, and a clear start point for the training)

timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
rank_zero_info(f"\n\n[{timestamp}]\nRestored all states from the checkpoint file at \n{self.resume_checkpoint_path}\n------------------------------------------\nStarting Training...\n------------------------------------------")

.
.
.
.
secondly, in fit_loop.py
(/PiperTTS/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py )
to give it a timestamp
ive amended this
(right near the top)

from pytorch_lightning.utilities.rank_zero import rank_zero_debug, rank_zero_info, rank_zero_warn
from pytorch_lightning.utilities.signature_utils import is_param_in_hook_signature

with this
(to import datetime)

from pytorch_lightning.utilities.rank_zero import rank_zero_debug, rank_zero_info, rank_zero_warn
from pytorch_lightning.utilities.signature_utils import is_param_in_hook_signature

from datetime import datetime

then replaced this output message:

rank_zero_info(f"`Trainer.fit` stopped: `max_epochs={self.max_epochs!r}` reached.")

with this:
(gives it a timestamp, and a clear finish point for the training)

timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
rank_zero_info(f"\n---------------------------------------------------------------------\n[{timestamp}] - `Trainer.fit` stopped: `max_epochs={self.max_epochs!r}` reached.\n---------------------------------------------------------------------")

.
.
.
.
thirdly, in local.py
(/PiperTTS/.venv/lib/python3.10/site-packages/fsspec/implementations/local.py)
ive amended this :
(added the datetime import just above the class, and added timestamp to the logger)

class LocalFileOpener(io.IOBase):
    def __init__(
        self, path, mode, autocommit=True, fs=None, compression=None, **kwargs
    ):
        logger.debug("open file: %s", path)
        self.path = path
        self.mode = mode
        self.fs = fs
        self.f = None
        self.autocommit = autocommit
        self.compression = get_compression(path, compression)
        self.blocksize = io.DEFAULT_BUFFER_SIZE
        self._open()

with this

from datetime import datetime

class LocalFileOpener(io.IOBase):
    def __init__(
        self, path, mode, autocommit=True, fs=None, compression=None, **kwargs
    ):
        timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        logger.debug(f" [{timestamp}] open: %s", path")
        self.path = path
        self.mode = mode
        self.fs = fs
        self.f = None
        self.autocommit = autocommit
        self.compression = get_compression(path, compression)
        self.blocksize = io.DEFAULT_BUFFER_SIZE
        self._open()

.
.
.
.
which instead of this kind of output:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
DEBUG:fsspec.local:open file: /home/train-me/lightning_logs/version_4/hparams.yaml
Restored all states from the checkpoint file at ~/train-me/lightning_logs/version_3/checkpoints/epoch=729-step=994262.ckpt
/home/PiperTTS/.venv/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:109: UserWarning: Total length of `DataLoader` across ranks is zero. Please make sure this was your intention.
  rank_zero_warn(
DEBUG:fsspec.local:open file: /home/train-me/lightning_logs/version_4/checkpoints/epoch=734-step=994302.ckpt
DEBUG:fsspec.local:open file: /home/train-me/lightning_logs/version_4/checkpoints/epoch=739-step=994342.ckpt
`Trainer.fit` stopped: `max_epochs=800` reached.

now produces this output in the console:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
DEBUG:fsspec.local: [2024-11-30 15:08:50] open: /home/train-me/lightning_logs/version_16/hparams.yaml

[2024-11-30 15:08:50]
Restored all states from the checkpoint file at
~/train-me/lightning_logs/version_15/checkpoints/epoch=939-step=995942.ckpt
------------------------
Starting Training...
------------------------
/home/PiperTTS/.venv/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:109: UserWarning: Total length of `DataLoader` across ranks is zero. Please make sure this was your intention.
  rank_zero_warn(
DEBUG:fsspec.local: [2024-11-30 15:12:31] open: /home/train-me/lightning_logs/version_16/checkpoints/epoch=944-step=995982.ckpt
DEBUG:fsspec.local: [2024-11-30 15:14:54] open: /home/train-me/lightning_logs/version_16/checkpoints/epoch=949-step=996022.ckpt

------------------------------------------------------------------------------
[2024-11-30 15:14:54] - `Trainer.fit` stopped: `max_epochs=950` reached.
------------------------------------------------------------------------------

@coffeecodeconverter coffeecodeconverter closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant