-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom Dataset support + Gentle-based custom dataset preprocessing support #78
Conversation
…f arguments) File "synthesis.py", line 137, in <module> model, text, p=replace_pronunciation_prob, speaker_id=speaker_id, fast=True) File "synthesis.py", line 66, in tts sequence, text_positions=text_positions, speaker_ids=speaker_ids) File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "H:\Tensorflow_Study\git\deepvoice3_pytorch\deepvoice3_pytorch\__init__.py", line 79, in forward text_positions, frame_positions, input_lengths) File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "H:\Tensorflow_Study\git\deepvoice3_pytorch\deepvoice3_pytorch\__init__.py", line 116, in forward text_sequences, lengths=input_lengths, speaker_embed=speaker_embed) File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "H:\Tensorflow_Study\git\deepvoice3_pytorch\deepvoice3_pytorch\deepvoice3.py", line 75, in forward x = self.embed_tokens(text_sequences) <- change this to long! File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\sparse.py", line 103, in forward self.scale_grad_by_freq, self.sparse File "H:\envs\pytorch\lib\site-packages\torch\nn\_functions\thnn\sparse.py", line 59, in forward output = torch.index_select(weight, 0, indices.view(-1)) TypeError: torch.index_select received an invalid combination of arguments - got (�[32;1mtorch.cuda.FloatTensor�[0m, �[32;1mint�[0m, �[31;1mtorch.cuda.IntTensor�[0m), but expected (torch.cuda.FloatTensor source, int dim, torch.cuda.LongTensor index) changed text_sequence to long, as required by torch.index_select.
This reverts commit 5214c24.
In windows, this causes WinError 123
Windows Specific Filename bugfix (r9y9#58) reverse PR
Supports JSON format for dataset creation. Ensures compatibility with http://github.com/carpedm20/multi-Speaker-tacotron-tensorflow
PR for Version Up in upstream repos
Reverse PR
gitignore change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! I'm happy to see your contributions. Once a few comments addressed I'd like to merge this. Let me know If you mess up squashing commits. I can hit squash and merge
if you want.
.gitignore
Outdated
presets/deepvoice3_got.json | ||
presets/deepvoice3_gotOnly.json | ||
presets/deepvoice3_stest.json | ||
presets/deepvoice3_test.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be safely removed? Assuming this is for your local only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I will remove it! :) Thanks for telling me
gentle_web_align.py
Outdated
server_addr = arguments['--server_addr'] | ||
port = int(arguments['--port']) | ||
max_unalign = float(arguments['--max_unalign']) | ||
if arguments['--nested-directories'] == None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: I'd slightly prefer is None
to ==None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I will change this too
gentle_web_align.py
Outdated
Created on Sat Apr 21 09:06:37 2018 | ||
Phoneme alignment and conversion in HTK-style label file using Web-served Gentle | ||
This works on any type of english dataset. | ||
This allows its usage on Windows (Via Docker) and external server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure, the reason using server-based Gentle rather than python API is that it allows use on Windows, right? Any other reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, and also because Gentle is python 2 compatible only, while this repo is python3 compatible.
In addition, if we use server-based Gentle, we can also use external server.
if os.path.splitext(wav_name)[0] != os.path.splitext(txt_name)[0]: | ||
print(' [!] wav name and transcript name does not match - exiting...') | ||
return response | ||
with open(txt_path, 'r', encoding='utf-8-sig') as txt_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing encoding='utf-8-sig'
is (almost) windows specific..? Did you see UnicodeError with encoding='utf-8'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it was in my case (probably because I am currently mixing up with Windows (for running pyTorch) and Linux(for data preparation/alignment)), and I think that setting encoidng='utf-8-sig' when opening file is better for ensuring compatibility.
hparams.py
Outdated
@@ -125,6 +125,14 @@ | |||
# Forced garbage collection probability | |||
# Use only when MemoryError continues in Windows (Disabled by default) | |||
#gc_probability = 0.001, | |||
|
|||
# json_meta mode only | |||
# 0: "use all", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider spaces rather than tab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops O.o will change it.
Another vestigial element
nikl_m.py
Outdated
@@ -4,6 +4,7 @@ | |||
import os | |||
import audio | |||
import re | |||
from hparams import hparams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: this is not necessary because #74 merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change it! thanks!
setup.py
Outdated
"numba", | ||
"lws <= 1.0", | ||
"nltk", | ||
"requests", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another vestigial element :/
will change it
setup.py
Outdated
@@ -82,10 +82,11 @@ def create_readme_rst(): | |||
"torch >= 0.3.0", | |||
"unidecode", | |||
"inflect", | |||
"librosa", | |||
"librosa == 0.5.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be loosened? I'm using developement version of librosa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, using stable latest version(from pip) caused synthesis error. (librosa/librosa#640)
Did it get fixed (it is for dev. version but for stable release)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't use librosa.output.write_wav
anymore so librosa/librosa#640 won't be a problem for me (and the repo). I can fix issues if you give me reproducible code. If you have code calling librosa.output.write_wav
locally, try replacing it with scipy.io.wavefile.write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discovered it was due to my own modifications. Thanks :)
hparams.py
Outdated
# 1: "ignore only unmatched_alignment", | ||
# 2: "fully ignore recognition", | ||
ignore_recognition_level = 2, | ||
min_text=20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking about this and something like min_frames
to remove short audio clips from training data. Just out of curiosity, did you get improvements by this? I believe the parameter highly depends on datasets and I'd be happy if you could leave a comment for exmaple: min_text=20 works good for dataset A but can be adjusted depends on dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it was implemented for some reasons.
- My automatic alignment tool(which I am going to release it soon) cannot handle short speeches well.
- From my experience, short speeches in non-dedicated dataset(especially that are extracted from movie clips) were prone to noises, and different cadence of speech.
(e.g. word "help" in "The help that is needed is not there." vs. "help" in "HELP!!!") - (From my experience with other deep learning based tts) Even if the dataset is nearly noise-free and has uniform cadence, short speeches tend to interfere the result. (probably because my test set is usually at least 3 words long)
But it was implemented as a quick-fix, and I do know that min_frame is much much better solution.
Will leave the comments :)
nikl_m.py
Outdated
spectrogram = audio.spectrogram(wav).astype(np.float32) | ||
except: | ||
print(wav_path) | ||
print(wav) | ||
n_frames = spectrogram.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O.O thought I removed it already.
Will remove it.
TLDR; Given that it's harder to learn alignments within samples that have long pauses, it would be good to have a max_pause params as well. The set of params I think are needed include phrase_length_min, phrase_length_max, max_silence_length. |
@rafaelvalle Great idea! |
@engiecat It's important to have thresholds for phrase length such that we can optimize for batch size and GPU usage. Remember that the longest sample in a batch will dominate the length of the batch because all shorter samples will be padded to match the length of the longest sample. Actually, I haven't trained deepvoice3 before and would be interested in knowing how well it fares with datasets that have a lot of variety in silence or speech rate. |
@rafaelvalle |
r9y9#53 (comment) issue solved in PyTorch 0.4
@r9y9 +) I fixed issue of #5 by changing the backend of matplotlib from Tkinter(TkAgg) to PyQt5(Qt5Agg). ++) Also, I discovered that the issue given in #53 (comment) seemed to be solved after PyTorch 0.4 upgrade. So, instead of changing hparams, I changed the code to give just a warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
…pport (r9y9#78) * Fixed typeerror (torch.index_select received an invalid combination of arguments) File "synthesis.py", line 137, in <module> model, text, p=replace_pronunciation_prob, speaker_id=speaker_id, fast=True) File "synthesis.py", line 66, in tts sequence, text_positions=text_positions, speaker_ids=speaker_ids) File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "H:\Tensorflow_Study\git\deepvoice3_pytorch\deepvoice3_pytorch\__init__.py", line 79, in forward text_positions, frame_positions, input_lengths) File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "H:\Tensorflow_Study\git\deepvoice3_pytorch\deepvoice3_pytorch\__init__.py", line 116, in forward text_sequences, lengths=input_lengths, speaker_embed=speaker_embed) File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "H:\Tensorflow_Study\git\deepvoice3_pytorch\deepvoice3_pytorch\deepvoice3.py", line 75, in forward x = self.embed_tokens(text_sequences) <- change this to long! File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\sparse.py", line 103, in forward self.scale_grad_by_freq, self.sparse File "H:\envs\pytorch\lib\site-packages\torch\nn\_functions\thnn\sparse.py", line 59, in forward output = torch.index_select(weight, 0, indices.view(-1)) TypeError: torch.index_select received an invalid combination of arguments - got (�[32;1mtorch.cuda.FloatTensor�[0m, �[32;1mint�[0m, �[31;1mtorch.cuda.IntTensor�[0m), but expected (torch.cuda.FloatTensor source, int dim, torch.cuda.LongTensor index) changed text_sequence to long, as required by torch.index_select. * Fixed Nonetype error in collect_features * requirements.txt fix * Memory Leakage bugfix + hparams change * Pre-PR modifications * Pre-PR modifications 2 * Pre-PR modifications 3 * Post-PR modification * remove requirements.txt * num_workers to 1 in train.py * Windows log filename bugfix * Revert "Windows log filename bugfix" This reverts commit 5214c24. * merge 2 * Windows Filename bugfix In windows, this causes WinError 123 * Cleanup before PR * JSON format Metadata support Supports JSON format for dataset creation. Ensures compatibility with http://github.com/carpedm20/multi-Speaker-tacotron-tensorflow * Web based Gentle aligner support * README change + gentle patch * .gitignore change gitignore change * Flake8 Fix * Post PR commit - Also fixed #5 r9y9#53 (comment) issue solved in PyTorch 0.4 * Post-PR 2 - .gitignore
Loss: 0.24586915151745664 During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Hello.
I have developed several new functionalities on the repo.
1. Custom dataset support.
Till now, dataset preprocessing had been limited to several well-known datasets.
(Unless someone create their own one and implement their own code)
I created custom dataset import option (which supports JSON (as described in carpedm20/multi-speaker-tacotron-tensorflow) and CSV metadata format).
Some datasets, especially auto-generated Korean dataset is currently available in the format, and the format itself is quite straightforward.
Below is the result done with the custom dataset (LJSpeech dataset with Proprietary Audiobook, automatically aligned)
LJSpeech
https://www.dropbox.com/s/du807tjcpyw3ddj/step000240000_text5_multispeaker0_predicted.wav?dl=0
Audiobook
https://www.dropbox.com/s/0xrl2z30d88z5e8/step000240000_text5_multispeaker1_predicted.wav?dl=0
2. Custom dataset HTK-style preprocessing
I have created a code for phoneme alignment of custom dataset using Gentle (assuming its server is running somewhere).
Although Festival and Merlin based alignment was better for the VCTK dataset, the performance was severely undermined when noisy dataset was introduced.
e.g. for this audio where a pause from 2.6s to 4.5s can be observed, the Festival/Merlin generated label file is as follows.
label file
This does not show pause between 2.6s to 4.5s.
In contrast, with Gentle-based alignment, the generated HTK-style label file is as follows
Gentle-generated label file
Where silence between 2.33s to 4.48s can be observed. (though actual silence begins from approx 2.6s.
After applying gentle-based phoneme alignment, performance had been moderately increased.
(Using same custom dataset, with the same hparams)
Without phoneme alignment
https://www.dropbox.com/s/9gd6r2rh4ppfwy6/step000260000_text5_multispeaker3_predicted.wav?dl=0
With gentle-based phoneme alignment
https://www.dropbox.com/s/uankezza1vfpo88/Gentlestep000260000_text5_multispeaker3_predicted.wav?dl=0
I am also considering to use phoneme alignment result to trim in-speech long pauses (one shown in above), by setting maximum threshold of the interval between each phoneme (like 0.5s).
PS. sorry for messy commits, will try to squash it after the merge. :/