Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misconfig? #118

Closed
ArEnSc opened this issue Jun 28, 2023 · 5 comments
Closed

Misconfig? #118

ArEnSc opened this issue Jun 28, 2023 · 5 comments

Comments

@ArEnSc
Copy link

ArEnSc commented Jun 28, 2023

I am seeing this error from lightning
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: The provided lr scheduler ExponentialLR doesn't follow PyTorch's LRScheduler API. You should override the LightningModule.lr_scheduler_step hook with your own logic if you are using a custom LR scheduler.

@ArEnSc
Copy link
Author

ArEnSc commented Jun 28, 2023

updated to the 2.0.~ of lightning
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/tensor/code/piper/src/python/piper_train/main.py", line 147, in
main()
File "/home/tensor/code/piper/src/python/piper_train/main.py", line 37, in main
Trainer.add_argparse_args(parser)
AttributeError: type object 'Trainer' has no attribute 'add_argparse_args'
they apparently removed this

@ArEnSc
Copy link
Author

ArEnSc commented Jun 28, 2023

tried lower version of pytorch lightning

DEBUG:piper_train:Checkpoints will be saved every 1000 epoch(s)
DEBUG:vits.dataset:Loading dataset: /home/tensor/code/piper/src/python/TrainingData/dataset.jsonl
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
DEBUG:fsspec.local:open file: /home/tensor/code/piper/src/python/TrainingData/lightning_logs/version_3/hparams.yaml
/home/tensor/code/piper/src/.venv/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:111: UserWarning: Total length of DataLoader across ranks is zero. Please make sure this was your intention.
rank_zero_warn(
/home/tensor/code/piper/src/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 24 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(
<--- gets stuck here with version 1.8.5

@ArEnSc
Copy link
Author

ArEnSc commented Jun 28, 2023

actually it looks like, I jump to tensor board from here and figure out how training is doing

@beqabeqa473
Copy link

1.7.7 is supported version

@coffeecodeconverter
Copy link

coffeecodeconverter commented Nov 30, 2024

some things to mention.
the point you're highlighting where its getting stuck, its not stuck, its doing the training at that point.

because, below is a snippet from my testbed.
notice after the warning about increasing num_workers for the trainer_dataloader, i have an additional warning compared to you about trainer batches being smaller than log_every_n_steps - but after that, its beginning my training, denoted by the 'checkpoints/epoch=734-step=994302.ckpt' and 'checkpoints/epoch=739-step=994342.ckpt' lines
(i have my checkpoint_epochs=5 to see updates more frequently)

  rank_zero_warn(
/home/PiperTTS/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: 
PossibleUserWarning: 
The dataloader, train_dataloader, does not have many workers which may be a bottleneck. 
Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.

   rank_zero_warn(
/home/PiperTTS/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1609: 
PossibleUserWarning: 
The number of training batches (4) is smaller than the logging interval Trainer(log_every_n_steps=50). 
Set a lower value for log_every_n_steps if you want to see logs for the training epoch.

  rank_zero_warn(
DEBUG:fsspec.local:open file: /home/train-me/lightning_logs/version_4/checkpoints/epoch=734-step=994302.ckpt
DEBUG:fsspec.local:open file: /home/train-me/lightning_logs/version_4/checkpoints/epoch=739-step=994342.ckpt

for me - my dataset, settings, and hardware, complete 1 x epoch in ~28seconds,
so i see an update in the console every ~2mins30secs after 5 x epochs complete.

your checkpoints occur every 1000 epochs - if its taking 1 min per epoch (depending on your dataset, batch size, precision level, test samples taken, your hardware capability, etc.) you wont see any updates in the console/terminal window for 1000 x minutes (16.6 hours!), and even if you're completing 1 x epoch every second, that's still going to take 16.6 minutes before you see any updates.
so you need to break it down into a smaller test, as there's no point potentially waiting up to 16 hours for an update before determining if you have an issue.

i'd suggest, creating a very small test dataset, like literally 2 or 3 lines in your dataset.jsonl, maybe record some fresh really short wavs maybe 5 words each file so it wont take long to process it.

then change your command to below:
(obviously, dont forget to change the path and name of the datasrt_dir, and resume_from_checkpoint arguments to point to your dataset and checkpoint, also amend the --max_epochs to maybe 5 higher than your current checkpoint, just for this test - for example, the checkpoint below is already on epoch 100 and ive set max_epochs to 105)

python3 -m piper_train \
--dataset-dir ~/train-me \
--accelerator 'gpu' \
--gpus 1 \
--batch-size 32 \
--validation-split 0.0 \
--num-test-examples 0 \
--max_epochs 105 \
--resume_from_checkpoint "~/train-me/lightning_logs/**VersionFolderWithYourCheckpoint**/checkpoints/epoch=100-step=1000.ckpt" \
--checkpoint-epochs 1 \
--precision 32 \
--max-phoneme-ids 400 \
--quality medium

then assume its going to take (on the extreme) 2 mins per epoch.
you should then expect to see SOMETHING within 2-6 mins.
either an error, or it progressing.
if you continue to see NOTHING then that would confirm you have some problem elsewhere.

but i just think you had too much data, with settings as such, that you wouldn't see an update in a VERY long time leading you to believe something was up or it had crashed.

if the test works, you know its nothing more than the combination of your dataset, settings, and hardware capability.
you can slowly regress from the test command and setup, back to your original command and setup.
for instance, add your original dataset back in, but keep the --checkpoint_epochs to a lower figure, 1 - 5 maybe?, then youd see an update far sooner and be reassured its ticking along.

after that, you can go back to setting --checkpoint_epochs to 1000 again if you wish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants