Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Phonemizer Support #17

Open
kbickar opened this issue Oct 17, 2023 · 12 comments
Open

Multiple Phonemizer Support #17

kbickar opened this issue Oct 17, 2023 · 12 comments

Comments

@kbickar
Copy link

kbickar commented Oct 17, 2023

The piper-phonemizer setup is a bit confusing at the moment as it's both a included with some significant code and a library imported at runtime. The two phonemizers text and espeak are both tightly integrated in piper and piper-phonemize. Furthermore, they are linked with the espeak-ng library which has the GPL license meaning piper-phonemize is also under the GPL license (when distributed) and thus also piper is under the GPL license.

My proposal is this:

  1. Create a standard interface for a phonemizer between piper/piper-phonemize. This could be 3 functions: initialize, phonemize, terminate. The initialize could also pass in configuration data if required.
  2. Have the phonemizer be selectable at startup via a flag instead of from the voice config. I'm not sure technically if there's a reason the phonemes are configured in the voice .json file, but it seems like that's not entirely necessary as long as the phonemes match.
  3. Separate the phonemizers within piper-phonemize to be different libraries that are loaded only if the configuration requires it.
    For example on Linux to phonemize text into a vector of phonemes using espeak:
        auto libraryHandle = dlopen("phoenmizer_espeak.so", RTLD_LAZY);
        auto phonemizeFn = (void (*)(const std::string, std::vector<std::vector<Phoneme>>&))GetProcAddress(static_cast<HMODULE>(libraryHandle), "phonemize");
        phonemizeFn(text, phonemes);

This would allow an easy way to integrate a new phonemizer without updating both programs and even allows a new library to be added without updating piper-phonemize. Plus, the dependency on espeak-ng would be optional which means it could be distributed under the much more permissive MIT license.

I can implement some of the changes to do this, but as it would be a fairly substantial change, I thought it would be best to discuss it first

@SeymourNickelson
Copy link

I was going to create a similar issue. Thanks to the author for all the hard work. Really cool project.

Judging by the comment here: OpenVoiceOS/ovos-tts-plugin-piper#2 (comment) I think the problem is that different phonemizers generate different IPA characters when phonemizing (because the author said piper models would likely need to be retrained for a new phonemizer).

So if another phonemizer generates a sequence of IPA characters the current models aren't trained on speech synthesis isn't going to work. There is a function in this repo phonemes_to_ids which will pass back "missing phonemes" if you feed piper phonemized text it doesn't understand (which an alternative phonemizer may or may not generate). I don't think the current phonemizer supports every IPA character so it's likely that just swapping in a new phonemizer isn't so easy.

Ideally if there was another phonemizer out there that restricts itself to output only the IPA characters espeak-ng currently uses it would be backward compatible with the already trained models. As long as the alternative phonemizer generates IPA characters that maps to an id that the piper models understand it should work. I don't think this would be a GPL issue as long as the new Phonemizer uses its own algorithm to phonemize (it wouldn't be a derivative of espeak-ng). I don't think you can GPL the alphabet.

@synesthesiam
Copy link
Contributor

I'm considering two alternatives to espeak-ng to avoid licensing issues:

  1. Using text phonemes with byte-pair encoding (BPE), possibly with [pre-trained sentencepiece models]
    (https://github.com/bheinzerling/bpemb)
  2. Reviving the gruut project and porting it to C++

In both cases, I expect that all of the voices will need to be retrained.

For option 1, I don't think training from a base English voice will work as well anymore because of differing character sets. Option 2 will have limited language support, and the licensing on the phoneme dictionaries is completely unknown (many have been floating around the internet for years without attribution).

Here's another question, likely with no answer: if I were to implement my own (clean room) copy of eSpeak's phonemization rule engine, would the dictionary data files be usable without a GPL license? I see 3 other licenses in the espeak-ng repo (BSD, Apache 2, UCD), so I have no idea what applies to the source code vs. data files.

@kbickar
Copy link
Author

kbickar commented Dec 20, 2023

The espeak library is pretty good at it's job and doesn't necessarily need to be replaced, it just needs to be less tightly coupled to piper so someone could swap it out with a different library if they wanted.

Then the phonemizer could be espeak or gruut or sequitur or whatever. Making a new phonemizer is a big endeavor and there's no need to re-invent the wheel.

Here's another question, likely with no answer: if I were to implement my own (clean room) copy of eSpeak's phonemization rule engine, would the dictionary data files be usable without a GPL license? I see 3 other licenses in the espeak-ng repo (BSD, Apache 2, UCD), so I have no idea what applies to the source code vs. data files.

The rules files at least have the GPLv3 license at the top, I imagine the dictionary would as well, but it's not TOO difficult to find dictionary files.

@SeymourNickelson
Copy link

SeymourNickelson commented Dec 20, 2023

The phonemizer appears to be tightly coupled to piper because the voice models piper uses understand the phonemes espeak produces. There isn't a universal way to phonemize. As the author said he expects that all the existing voice models would need to be retrained for a different phonemizer. If you have to train a new voice model per phonemizer that isn't going to scale.

I tried swapping in a different phonemizer but it phonemizes in a different way than espeak; it uses some phenomes that espeak doesn't use and vice-versa. I think I can remap some of the phonemes in the replacement phonemizer to equivalent ones the model understands to mitigate but it looks like it is going to be a bit hairy.

A less sophisticated yet still complicated approach is to build a phonemizer using all the same phonemes as espeak (no more and no less). The IPA characters themselves can't be GPL'd. If you could GPL the alphabet all written text would be considered a derivative.

I don't think you can GPL a map table: ["Apple" : 🍎]
but the dictionary data files seem to be a bit more than that? I would think using them directly would probably require you to license the new phonemizer under the GPL and would probably be better to avoid.

A phonemizer that outputs the same IPA characters would be backward compatible, though perhaps constraining oneself to use only the phonemes that espeak does would feel too restricting. That would perhaps be one of the tradeoffs for trying to be a "swap in" replacement for espeak.

@synesthesiam
Copy link
Contributor

synesthesiam commented Dec 20, 2023

@SeymourNickelson I actually did train an "eSpeak compatible" phonemizer in gruut; there are separate database files for that. It works OK, but espeak-ng is a bit more sophisticated than you might expect. It handles some part-of-speech dependent pronunciation rules (for English at least) like "I read a book yesterday" vs. "I read books often". Additionally, it's able to break apart words somewhat intelligently: like pronouncing a username hansenm as "Hansen M".

@kbickar I don't want to reinvent the wheel, but the licensing question comes up quite frequently. Similarly, using the Lessac voice as a base adds more questions when people want to user Piper commercially. While I sympathize with the GPL philosophy, I prefer to keep my stuff MIT/public domain. And if I'm going to suggest people contribute pronunciation fixes, etc. it makes more sense to do it for a project with fewer restrictions.

@kbickar
Copy link
Author

kbickar commented Dec 20, 2023

At least the ability to train a new base model from scratch is relatively straight forward so creating a model without the lessac dataset can be done and use piper out of the box.

Some sort of plugin interface would be great

@SeymourNickelson
Copy link

SeymourNickelson commented Dec 21, 2023

@SeymourNickelson I actually did train an "eSpeak compatible" phonemizer in gruut; there are separate database files for that. It works OK, but espeak-ng is a bit more sophisticated than you might expect. It handles some part-of-speech dependent pronunciation rules (for English at least) like "I read a book yesterday" vs. "I read books often". Additionally, it's able to break apart words somewhat intelligently: like pronouncing a username hansenm as "Hansen M".

Cool! I'll have to check out Gruut. It seems eSpeak tries to go the extra mile phonemizing (I haven't looked at the internals) but it definitely doesn't handle everything perfectly either. In my testing it didn't handle "I read a book yesterday" properly. I wonder if there is a good open source "part of speech tagger" out there that input text could be fed to first before phonemizing, which could be used to disambiguate those words pronounced differently in different contexts.

Unfortunately for me I'm not working in Python so I'd have to port Gruut to my native programming language (which isn't C++ either, although a C++ version would be more accessible for my target platform). Might be worth it. I just did this (ported from Python) with another phonemizer but unfortunately that one phonemizes every word independently and not in the context of surrounding words; it doesn't try to handle some of these complex pronunciation rules you mention.

The supported language list of Gruut would be enough for me so if you did port that to C++ for Piper at some point maybe those needing a phonemizer in another language could fallback to espeak.

@JarbasAl
Copy link

JarbasAl commented Dec 21, 2023

from a dev POV, I would like to see gruut as an option, and honestly would love to see a c++ incarnation that is continuously updated, like this project is now tackling license issues instead of focusing on the code the same will happen to future projects that use espeak (i expect that to not be uncommon, due to lack of alternatives), a permissively licensed phonemizer to replace espeak would benefit the whole voice ecosystem and help future devs and projects avoiding this same issue

let's assume gruut voices sound worse than espeak voices, from a user POV would be nice if piper supported both gruut and espeak voices, just by making espeak optional that makes piper GPL free, then using a voice that needs espeak will drag the GPL license, but that is then voice specific and not library specific, users can use whatever voice sounds best to them, espeak or gruut based voices, a user won't care about GPL

i understand this means at least double the work, without even counting the time to port gruut to c++, totally understandable if it's not feasible but i wanted to leave my 2 cents

@JarbasAl
Copy link

1. Using text phonemes with byte-pair encoding (BPE), possibly with [pre-trained sentencepiece models]
   ([bheinzerling/bpemb](https://github.com/bheinzerling/bpemb))

I recently came across this paper https://assets.amazon.science/25/ae/5d36cc3843d1b906647b6b528c1b/phonetically-induced-subwords-for-end-to-end-speech-recognition.pdf and I previously also played around with this repo https://github.com/hainan-xv/PASM

this is a bit over my area of expertise, but you should be able to understand the nuances better and judge if its applicable or useful

quote

Closer to our approach is the Pronunciation Assisted
Subword Modelling (PASM) that was shown to outperform
BPE and single character baselines [27]. Subword generation
in PASM is based on consistent alignments between single
phonemes and single characters. A downside of this approach
is that it tends to choose short subwords and avoids modelling
full words with single tokens. As a consequence, subword
variability is limited and, along with the method’s exclusion
criteria, the resulting vocabularies are relatively small (around
100 and 200 subwords for WSJ and Librispeech respectively).
We compare our results to PASM in our 200 subword
experiments.

apologies if this is irrelevant, but since you mentioned BPE i thought it could be helpful

@lumpidu
Copy link

lumpidu commented Jan 13, 2024

Not directly related maybe in adoption to general purpose phonemizers.

I did integrate our Icelandic phonemizer as an alternative to eSpeak into piper directly, because the Icelandic version of eSpeak uses an old IPA symbol set and additionally the normalization is not very good for Icelandic. E.g. homographs, dates and numbers, I am looking at you ...

The integration was not really difficult, took me half a day or so, because our pipeline is also Python-based: see https://github.com/grammatek/ice-g2p for the phonemizer. I changed in Piper however the symbols and only used those of our alphabet. As I am training from scratch and don't want to fine-tune any existing model, that's probably ok.

We are using X-SAMPA by default in our grammars and symbols, but remapping this to IPA is just a lookup. See https://github.com/grammatek/ice-g2p/blob/master/src/ice_g2p/data/sampa_ipa_single_flite.csv

We also have an Android App that uses the C++ library Thrax for G2P: https://github.com/grammatek/g2p-thrax. Thrax is totally rule-based and does not perform as good as the other G2P module, but good enough for most purposes.

The former uses a BI-LSTM for the conversion which is pretty good. But Icelandic is also very regular with pronunciation and only homographs need to be treated specifically.

What we do additionally is using a very big G2P dictionary to speed up our inference time. This just needs to be processed once in a while offline and then you can use it efficiently at runtime. If you are processing a large enough corpus of a specific language, you will get a very good coverage for most words. And homographs can be chosen dynamically depending on some rules/model instead.

What we found absolutely necessary for text normalization is to use a PoS tagger. We also have trained a BI-LSTM - based PoS model within our Icelandic language technology program, but there are also some other alternatives available. For Python, e.g. StanfordNLP Stanza with Apache 2.0 license.

@SeymourNickelson
Copy link

@lumpidu Thanks a lot for sharing. Very informative.

I integrated another Phonemizer in Python to use in the Piper training script (basically followed the training guide). The only dependency I didn't install is piper-phonemize (I just stubbed in my own Python module) that returned the expected data for preprocessing (shimmed in all the values from the replacement phonemizer).

Because this phonemizer uses different symbols than espeak I also need to train from scratch. Do you mind sharing what hardware you are training on? I can't get piper to train on the GPU on my Apple hardware (and I'm not sure if even if I could get training on the GPU , if it would be fast enough). Google Colab keeps throwing me of my training session before I can finish even though I still have compute credits. Colab feels like a weird system, just throw a paying customer out in the middle of a training session at any time, no questions asked; delete all the data and keep the money!

@lumpidu
Copy link

lumpidu commented Jan 13, 2024

@SeymourNickelson : sure, we use our own compute hardware, a Ryzen Threadripper Pro Workstation with 32 cores, 512GB RAM, lots of SSD's and 2x A6000 Nvidia cards. There is also a 3090 card inside, that I mostly use for inferencing. I am currently training an xs modell (with smaller parameter size but 22.05 kHz files) on my 2 x A6000 cards. This model is meant for inferencing on the Android phone. Training runs smoothly with now a bit more than 1500 epochs after overall almost 2 days, i.e. ~ 110 seconds/epoch with a > 17.000 files dataset. Because these cards have 48GB RAM, I use a batch_size of 64, symbol_size of 600 and still the memory is not halfway filled.
I have no experience with Google Colab. We decided against using cloud GPU's one year ago, and owning a dedicated GPU workstation has a lot of pros that I don't want to miss. OTOH, I would use cloud GPU's from some of the usual suspects for trainings that need longer time than, say, 2 weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants