You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello there. I'm in the process of setting up a Colab notebook to train a couple models needed for inference in Soft-VC, mostly for personal ease-of-access. One of the first steps is obtaining a custom trained/finetuned HuBERT model.
I've spent trial and error making sure the directories are properly set up/prepared for usage in training/finetuning on Colab, and I believe I have something that works. There is of course one issue however...
When beginning to train the model (running on Colab with a Tesla T4), at first it initializes properly but soon after I'm thrown a RuntimeError. Here's the full log:
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for 1 nodes.
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
INFO:__mp_main__:********************************************************************************
INFO:__mp_main__:PyTorch version: 1.9.1+cu102
INFO:__mp_main__:CUDA version: 10.2
INFO:__mp_main__:CUDNN version: 7605
INFO:__mp_main__:CUDNN enabled: True
INFO:__mp_main__:CUDNN deterministic: False
INFO:__mp_main__:CUDNN benchmark: False
INFO:__mp_main__:# of GPUS: 1
INFO:__mp_main__:batch size: 64
INFO:__mp_main__:iterations per epoch: 1
INFO:__mp_main__:# of epochs: 25001
INFO:__mp_main__:started at epoch: 1
INFO:__mp_main__:********************************************************************************
/content/hubert/train.py:232: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
nn.utils.clip_grad_norm_(hubert.parameters(), MAX_NORM)
INFO:__mp_main__:
train -- epoch: 1, masked loss: 5.6476, unmasked loss: 6.2735,
masked accuracy: 1.80, umasked accuracy: 1.59
INFO:root:Reducer buckets have been rebuilt in this iteration.
INFO:__mp_main__:
train -- epoch: 2, masked loss: 5.7180, unmasked loss: 6.3430,
masked accuracy: 2.09, umasked accuracy: 2.44
INFO:__mp_main__:
train -- epoch: 3, masked loss: 5.7401, unmasked loss: 6.3748,
masked accuracy: 1.54, umasked accuracy: 2.33
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/content/hubert/train.py", line 202, in train
for wavs, codes in train_loader:
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/content/hubert/hubert/dataset.py", line 90, in collate
codes = torch.stack(collated_codes, dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [170] at entry 5
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 66, in _wrap
sys.exit(1)
SystemExit: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/process.py", line 300, in _bootstrap
util._exit_function()
File "/usr/lib/python3.7/multiprocessing/util.py", line 357, in _exit_function
p.join()
File "/usr/lib/python3.7/multiprocessing/process.py", line 140, in join
res = self._popen.wait(timeout)
File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 8288) is killed by signal: Terminated.
Traceback (most recent call last):
File "train.py", line 452, in <module>
join=True,
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/content/hubert/train.py", line 202, in train
for wavs, codes in train_loader:
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/content/hubert/hubert/dataset.py", line 90, in collate
codes = torch.stack(collated_codes, dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [170] at entry 5
As you can see, it actually starts training up to 3 epochs, then stops suddenly due to the error. On some occasions when trying it again, I've seen it go up to at least 5 or 6 epochs before throwing the same error. Other cases it will immediately throw the error even before 1 epoch.
From what I was able to find, I believe adding some kind of padding in hubert/dataset.py to the collated wavs and codes could potentially work. Unfortunately I wouldn't know for sure how to implement that. Mainly I'm just interested in training/finetuning a personal model to test custom Soft-VC inference.
This could also be the fault of something else. I've definitely made sure to follow the training prep you provided (much appreciation for that by the way) and everything, but I should mention the dataset I'm using is 22khz. If that's an issue, please let me know.
Great potential for this! Just need to put the puzzle together so-to-speak.
The text was updated successfully, but these errors were encountered:
Hello there. I'm in the process of setting up a Colab notebook to train a couple models needed for inference in Soft-VC, mostly for personal ease-of-access. One of the first steps is obtaining a custom trained/finetuned HuBERT model.
I've spent trial and error making sure the directories are properly set up/prepared for usage in training/finetuning on Colab, and I believe I have something that works. There is of course one issue however...
When beginning to train the model (running on Colab with a Tesla T4), at first it initializes properly but soon after I'm thrown a RuntimeError. Here's the full log:
As you can see, it actually starts training up to 3 epochs, then stops suddenly due to the error. On some occasions when trying it again, I've seen it go up to at least 5 or 6 epochs before throwing the same error. Other cases it will immediately throw the error even before 1 epoch.
From what I was able to find, I believe adding some kind of padding in
hubert/dataset.py
to the collated wavs and codes could potentially work. Unfortunately I wouldn't know for sure how to implement that. Mainly I'm just interested in training/finetuning a personal model to test custom Soft-VC inference.This could also be the fault of something else. I've definitely made sure to follow the training prep you provided (much appreciation for that by the way) and everything, but I should mention the dataset I'm using is 22khz. If that's an issue, please let me know.
Great potential for this! Just need to put the puzzle together so-to-speak.
The text was updated successfully, but these errors were encountered: