[TTS] Global Style Tokens implementation in FastPitch doesn't follow the original paper #7420

anferico · 2023-09-12T15:14:47Z

Describe the bug

The implementation of Global Style Tokens (GSTs) in FastPitch introduced in #6417 does not follow the prescription of the original paper. In particular, the difference lies in the choice of the reference audio for a given training/validation sample <text, ground_truth_audio>:

In the original paper, it is recommended to use the ground-truth audio as the reference audio
In the NeMo implementation, the reference audio is chosen at random among audio samples from the same speaker

I discovered this after training a FastPitch model with GSTs and observing that the choice of the reference audio at inference time would have virtually no impact at all. More precisely, given a text T and n different reference audios R_1, ..., R_n, passing <T, R_i> for any i in [1, n] would result in almost the exact same audio A. I say "almost" because the files appeared different when compared using diff, but they had the exact same length and they sounded exactly the same to the ear.

So what I did was to modify the code that's responsible for selecting the reference audio in nemo.collections.tts.data.dataset.TTSDataset.__getitem__, effectively changing this:
reference_audio = self.featurizer.process(self.data[reference_index]["audio_filepath"], ...)
to this:
reference_audio = self.featurizer.process(sample["audio_filepath"], ...)

which is exactly what the GST paper prescribes. In this case, I observed that the choice of the reference audio at inference time did make a difference, however small (although this is probably due to the fact that the training set I used was not very varied in terms of speaking styles).

My intuition is that if an audio A1 is used to compute the style embedding that is then used to condition the generation of the Mel spectrogram for another audio A2, the model will learn to ignore that style embedding because the information that can be extracted from A1 is useless to generate the spectrogram for A2. You could argue that this isn't quite true because A1 and A2 come from the same speaker, hence you could extract speaker information from A1 to generate the spectrogram for A2, but FastPitch already contains a SpeakerLookup module that takes care of encoding speaker information.

Steps/Code to reproduce bug

Train a FastPitch model using GSTs, for example using the following configuration file: examples/tts/conf/fastpitch_align_44100_adapter.yaml
Perform multiple inference steps where the input text is always the same, but the reference audio is different
Generate audios using a vocoder such as HiFi-GAN

Expected behavior

Choosing different reference audios at inference time results in different audios produced as output.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of NeMo install: pip install

Environment details

OS version: Ubuntu 20.04.2
PyTorch version: 2.0.1
Python version: 3.8

Additional context

Add any other context about the problem here.
GPU model: RTX 3090 Ti

The text was updated successfully, but these errors were encountered:

github-actions · 2023-10-13T01:45:17Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

anferico · 2023-10-13T06:24:20Z

Any updates? At this stage, I'm just trying to understand if the difference in the implementation is wanted or if it's just a bug.

hsiehjackson · 2023-10-24T21:15:49Z

Solved by PR #7788

anferico added the bug Something isn't working label Sep 12, 2023

github-actions bot added the stale label Oct 13, 2023

github-actions bot removed the stale label Oct 14, 2023

anferico mentioned this issue Oct 22, 2023

[TTS] Fix Global Style Tokens (GSTs) implementation in FastPitch #7777

Closed

8 tasks

XuesongYang assigned subhankar-ghosh and hsiehjackson Oct 23, 2023

XuesongYang added the TTS label Oct 23, 2023

hsiehjackson closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TTS] Global Style Tokens implementation in FastPitch doesn't follow the original paper #7420

[TTS] Global Style Tokens implementation in FastPitch doesn't follow the original paper #7420

anferico commented Sep 12, 2023

github-actions bot commented Oct 13, 2023

anferico commented Oct 13, 2023

hsiehjackson commented Oct 24, 2023

[TTS] Global Style Tokens implementation in FastPitch doesn't follow the original paper #7420

[TTS] Global Style Tokens implementation in FastPitch doesn't follow the original paper #7420

Comments

anferico commented Sep 12, 2023

github-actions bot commented Oct 13, 2023

anferico commented Oct 13, 2023

hsiehjackson commented Oct 24, 2023