Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTS] Global Style Tokens implementation in FastPitch doesn't follow the original paper #7420

Closed
anferico opened this issue Sep 12, 2023 · 3 comments
Assignees
Labels
bug Something isn't working TTS

Comments

@anferico
Copy link
Contributor

Describe the bug

The implementation of Global Style Tokens (GSTs) in FastPitch introduced in #6417 does not follow the prescription of the original paper. In particular, the difference lies in the choice of the reference audio for a given training/validation sample <text, ground_truth_audio>:

  • In the original paper, it is recommended to use the ground-truth audio as the reference audio
  • In the NeMo implementation, the reference audio is chosen at random among audio samples from the same speaker

I discovered this after training a FastPitch model with GSTs and observing that the choice of the reference audio at inference time would have virtually no impact at all. More precisely, given a text T and n different reference audios R_1, ..., R_n, passing <T, R_i> for any i in [1, n] would result in almost the exact same audio A. I say "almost" because the files appeared different when compared using diff, but they had the exact same length and they sounded exactly the same to the ear.

So what I did was to modify the code that's responsible for selecting the reference audio in nemo.collections.tts.data.dataset.TTSDataset.__getitem__, effectively changing this:
reference_audio = self.featurizer.process(self.data[reference_index]["audio_filepath"], ...)
to this:
reference_audio = self.featurizer.process(sample["audio_filepath"], ...)

which is exactly what the GST paper prescribes. In this case, I observed that the choice of the reference audio at inference time did make a difference, however small (although this is probably due to the fact that the training set I used was not very varied in terms of speaking styles).

My intuition is that if an audio A1 is used to compute the style embedding that is then used to condition the generation of the Mel spectrogram for another audio A2, the model will learn to ignore that style embedding because the information that can be extracted from A1 is useless to generate the spectrogram for A2. You could argue that this isn't quite true because A1 and A2 come from the same speaker, hence you could extract speaker information from A1 to generate the spectrogram for A2, but FastPitch already contains a SpeakerLookup module that takes care of encoding speaker information.

Steps/Code to reproduce bug

  1. Train a FastPitch model using GSTs, for example using the following configuration file: examples/tts/conf/fastpitch_align_44100_adapter.yaml
  2. Perform multiple inference steps where the input text is always the same, but the reference audio is different
  3. Generate audios using a vocoder such as HiFi-GAN

Expected behavior

Choosing different reference audios at inference time results in different audios produced as output.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of NeMo install: pip install

Environment details

  • OS version: Ubuntu 20.04.2
  • PyTorch version: 2.0.1
  • Python version: 3.8

Additional context

Add any other context about the problem here.
GPU model: RTX 3090 Ti

@anferico anferico added the bug Something isn't working label Sep 12, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Oct 13, 2023
@anferico
Copy link
Contributor Author

Any updates? At this stage, I'm just trying to understand if the difference in the implementation is wanted or if it's just a bug.

@hsiehjackson
Copy link
Collaborator

Solved by PR #7788

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working TTS
Projects
None yet
Development

No branches or pull requests

4 participants