Misconfig? #118

ArEnSc · 2023-06-28T07:04:09Z

I am seeing this error from lightning
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: The provided lr scheduler ExponentialLR doesn't follow PyTorch's LRScheduler API. You should override the LightningModule.lr_scheduler_step hook with your own logic if you are using a custom LR scheduler.

The text was updated successfully, but these errors were encountered:

ArEnSc · 2023-06-28T07:06:07Z

updated to the 2.0.~ of lightning
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/tensor/code/piper/src/python/piper_train/main.py", line 147, in
main()
File "/home/tensor/code/piper/src/python/piper_train/main.py", line 37, in main
Trainer.add_argparse_args(parser)
AttributeError: type object 'Trainer' has no attribute 'add_argparse_args'
they apparently removed this

ArEnSc · 2023-06-28T07:12:55Z

tried lower version of pytorch lightning

DEBUG:piper_train:Checkpoints will be saved every 1000 epoch(s)
DEBUG:vits.dataset:Loading dataset: /home/tensor/code/piper/src/python/TrainingData/dataset.jsonl
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
DEBUG:fsspec.local:open file: /home/tensor/code/piper/src/python/TrainingData/lightning_logs/version_3/hparams.yaml
/home/tensor/code/piper/src/.venv/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:111: UserWarning: Total length of DataLoader across ranks is zero. Please make sure this was your intention.
rank_zero_warn(
/home/tensor/code/piper/src/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 24 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(
<--- gets stuck here with version 1.8.5

ArEnSc · 2023-06-28T07:36:27Z

actually it looks like, I jump to tensor board from here and figure out how training is doing

beqabeqa473 · 2023-06-28T09:24:05Z

1.7.7 is supported version

coffeecodeconverter · 2024-11-30T12:17:38Z

some things to mention.
the point you're highlighting where its getting stuck, its not stuck, its doing the training at that point.

because, below is a snippet from my testbed.
notice after the warning about increasing num_workers for the trainer_dataloader, i have an additional warning compared to you about trainer batches being smaller than log_every_n_steps - but after that, its beginning my training, denoted by the 'checkpoints/epoch=734-step=994302.ckpt' and 'checkpoints/epoch=739-step=994342.ckpt' lines
(i have my checkpoint_epochs=5 to see updates more frequently)

  rank_zero_warn(
/home/PiperTTS/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: 
PossibleUserWarning: 
The dataloader, train_dataloader, does not have many workers which may be a bottleneck. 
Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.

   rank_zero_warn(
/home/PiperTTS/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1609: 
PossibleUserWarning: 
The number of training batches (4) is smaller than the logging interval Trainer(log_every_n_steps=50). 
Set a lower value for log_every_n_steps if you want to see logs for the training epoch.

  rank_zero_warn(
DEBUG:fsspec.local:open file: /home/train-me/lightning_logs/version_4/checkpoints/epoch=734-step=994302.ckpt
DEBUG:fsspec.local:open file: /home/train-me/lightning_logs/version_4/checkpoints/epoch=739-step=994342.ckpt

for me - my dataset, settings, and hardware, complete 1 x epoch in ~28seconds,
so i see an update in the console every ~2mins30secs after 5 x epochs complete.

your checkpoints occur every 1000 epochs - if its taking 1 min per epoch (depending on your dataset, batch size, precision level, test samples taken, your hardware capability, etc.) you wont see any updates in the console/terminal window for 1000 x minutes (16.6 hours!), and even if you're completing 1 x epoch every second, that's still going to take 16.6 minutes before you see any updates.
so you need to break it down into a smaller test, as there's no point potentially waiting up to 16 hours for an update before determining if you have an issue.

i'd suggest, creating a very small test dataset, like literally 2 or 3 lines in your dataset.jsonl, maybe record some fresh really short wavs maybe 5 words each file so it wont take long to process it.

then change your command to below:
(obviously, dont forget to change the path and name of the datasrt_dir, and resume_from_checkpoint arguments to point to your dataset and checkpoint, also amend the --max_epochs to maybe 5 higher than your current checkpoint, just for this test - for example, the checkpoint below is already on epoch 100 and ive set max_epochs to 105)

python3 -m piper_train \
--dataset-dir ~/train-me \
--accelerator 'gpu' \
--gpus 1 \
--batch-size 32 \
--validation-split 0.0 \
--num-test-examples 0 \
--max_epochs 105 \
--resume_from_checkpoint "~/train-me/lightning_logs/**VersionFolderWithYourCheckpoint**/checkpoints/epoch=100-step=1000.ckpt" \
--checkpoint-epochs 1 \
--precision 32 \
--max-phoneme-ids 400 \
--quality medium

then assume its going to take (on the extreme) 2 mins per epoch.
you should then expect to see SOMETHING within 2-6 mins.
either an error, or it progressing.
if you continue to see NOTHING then that would confirm you have some problem elsewhere.

but i just think you had too much data, with settings as such, that you wouldn't see an update in a VERY long time leading you to believe something was up or it had crashed.

if the test works, you know its nothing more than the combination of your dataset, settings, and hardware capability.
you can slowly regress from the test command and setup, back to your original command and setup.
for instance, add your original dataset back in, but keep the --checkpoint_epochs to a lower figure, 1 - 5 maybe?, then youd see an update far sooner and be reassured its ticking along.

after that, you can go back to setting --checkpoint_epochs to 1000 again if you wish.

synesthesiam closed this as completed Sep 14, 2023

coffeecodeconverter mentioned this issue Dec 1, 2024

Windows support #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misconfig? #118

Misconfig? #118

ArEnSc commented Jun 28, 2023

ArEnSc commented Jun 28, 2023

ArEnSc commented Jun 28, 2023

ArEnSc commented Jun 28, 2023

beqabeqa473 commented Jun 28, 2023

coffeecodeconverter commented Nov 30, 2024 •

edited

Loading

Misconfig? #118

Misconfig? #118

Comments

ArEnSc commented Jun 28, 2023

ArEnSc commented Jun 28, 2023

ArEnSc commented Jun 28, 2023

ArEnSc commented Jun 28, 2023

beqabeqa473 commented Jun 28, 2023

coffeecodeconverter commented Nov 30, 2024 • edited Loading

coffeecodeconverter commented Nov 30, 2024 •

edited

Loading