-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Phonemizer Support #17
Comments
I was going to create a similar issue. Thanks to the author for all the hard work. Really cool project. Judging by the comment here: OpenVoiceOS/ovos-tts-plugin-piper#2 (comment) I think the problem is that different phonemizers generate different IPA characters when phonemizing (because the author said piper models would likely need to be retrained for a new phonemizer). So if another phonemizer generates a sequence of IPA characters the current models aren't trained on speech synthesis isn't going to work. There is a function in this repo phonemes_to_ids which will pass back "missing phonemes" if you feed piper phonemized text it doesn't understand (which an alternative phonemizer may or may not generate). I don't think the current phonemizer supports every IPA character so it's likely that just swapping in a new phonemizer isn't so easy. Ideally if there was another phonemizer out there that restricts itself to output only the IPA characters espeak-ng currently uses it would be backward compatible with the already trained models. As long as the alternative phonemizer generates IPA characters that maps to an id that the piper models understand it should work. I don't think this would be a GPL issue as long as the new Phonemizer uses its own algorithm to phonemize (it wouldn't be a derivative of espeak-ng). I don't think you can GPL the alphabet. |
I'm considering two alternatives to espeak-ng to avoid licensing issues:
In both cases, I expect that all of the voices will need to be retrained. For option 1, I don't think training from a base English voice will work as well anymore because of differing character sets. Option 2 will have limited language support, and the licensing on the phoneme dictionaries is completely unknown (many have been floating around the internet for years without attribution). Here's another question, likely with no answer: if I were to implement my own (clean room) copy of eSpeak's phonemization rule engine, would the dictionary data files be usable without a GPL license? I see 3 other licenses in the espeak-ng repo (BSD, Apache 2, UCD), so I have no idea what applies to the source code vs. data files. |
The espeak library is pretty good at it's job and doesn't necessarily need to be replaced, it just needs to be less tightly coupled to piper so someone could swap it out with a different library if they wanted. Then the phonemizer could be espeak or gruut or sequitur or whatever. Making a new phonemizer is a big endeavor and there's no need to re-invent the wheel.
The rules files at least have the GPLv3 license at the top, I imagine the dictionary would as well, but it's not TOO difficult to find dictionary files. |
The phonemizer appears to be tightly coupled to piper because the voice models piper uses understand the phonemes espeak produces. There isn't a universal way to phonemize. As the author said he expects that all the existing voice models would need to be retrained for a different phonemizer. If you have to train a new voice model per phonemizer that isn't going to scale. I tried swapping in a different phonemizer but it phonemizes in a different way than espeak; it uses some phenomes that espeak doesn't use and vice-versa. I think I can remap some of the phonemes in the replacement phonemizer to equivalent ones the model understands to mitigate but it looks like it is going to be a bit hairy. A less sophisticated yet still complicated approach is to build a phonemizer using all the same phonemes as espeak (no more and no less). The IPA characters themselves can't be GPL'd. If you could GPL the alphabet all written text would be considered a derivative. I don't think you can GPL a map table: ["Apple" : 🍎] A phonemizer that outputs the same IPA characters would be backward compatible, though perhaps constraining oneself to use only the phonemes that espeak does would feel too restricting. That would perhaps be one of the tradeoffs for trying to be a "swap in" replacement for espeak. |
@SeymourNickelson I actually did train an "eSpeak compatible" phonemizer in gruut; there are separate database files for that. It works OK, but espeak-ng is a bit more sophisticated than you might expect. It handles some part-of-speech dependent pronunciation rules (for English at least) like "I read a book yesterday" vs. "I read books often". Additionally, it's able to break apart words somewhat intelligently: like pronouncing a username @kbickar I don't want to reinvent the wheel, but the licensing question comes up quite frequently. Similarly, using the Lessac voice as a base adds more questions when people want to user Piper commercially. While I sympathize with the GPL philosophy, I prefer to keep my stuff MIT/public domain. And if I'm going to suggest people contribute pronunciation fixes, etc. it makes more sense to do it for a project with fewer restrictions. |
At least the ability to train a new base model from scratch is relatively straight forward so creating a model without the lessac dataset can be done and use piper out of the box. Some sort of plugin interface would be great |
Cool! I'll have to check out Gruut. It seems eSpeak tries to go the extra mile phonemizing (I haven't looked at the internals) but it definitely doesn't handle everything perfectly either. In my testing it didn't handle "I read a book yesterday" properly. I wonder if there is a good open source "part of speech tagger" out there that input text could be fed to first before phonemizing, which could be used to disambiguate those words pronounced differently in different contexts. Unfortunately for me I'm not working in Python so I'd have to port Gruut to my native programming language (which isn't C++ either, although a C++ version would be more accessible for my target platform). Might be worth it. I just did this (ported from Python) with another phonemizer but unfortunately that one phonemizes every word independently and not in the context of surrounding words; it doesn't try to handle some of these complex pronunciation rules you mention. The supported language list of Gruut would be enough for me so if you did port that to C++ for Piper at some point maybe those needing a phonemizer in another language could fallback to espeak. |
from a dev POV, I would like to see gruut as an option, and honestly would love to see a c++ incarnation that is continuously updated, like this project is now tackling license issues instead of focusing on the code the same will happen to future projects that use espeak (i expect that to not be uncommon, due to lack of alternatives), a permissively licensed phonemizer to replace espeak would benefit the whole voice ecosystem and help future devs and projects avoiding this same issue let's assume gruut voices sound worse than espeak voices, from a user POV would be nice if piper supported both gruut and espeak voices, just by making espeak optional that makes piper GPL free, then using a voice that needs espeak will drag the GPL license, but that is then voice specific and not library specific, users can use whatever voice sounds best to them, espeak or gruut based voices, a user won't care about GPL i understand this means at least double the work, without even counting the time to port gruut to c++, totally understandable if it's not feasible but i wanted to leave my 2 cents |
I recently came across this paper https://assets.amazon.science/25/ae/5d36cc3843d1b906647b6b528c1b/phonetically-induced-subwords-for-end-to-end-speech-recognition.pdf and I previously also played around with this repo https://github.com/hainan-xv/PASM this is a bit over my area of expertise, but you should be able to understand the nuances better and judge if its applicable or useful quote
apologies if this is irrelevant, but since you mentioned BPE i thought it could be helpful |
Not directly related maybe in adoption to general purpose phonemizers. I did integrate our Icelandic phonemizer as an alternative to eSpeak into piper directly, because the Icelandic version of eSpeak uses an old IPA symbol set and additionally the normalization is not very good for Icelandic. E.g. homographs, dates and numbers, I am looking at you ... The integration was not really difficult, took me half a day or so, because our pipeline is also Python-based: see https://github.com/grammatek/ice-g2p for the phonemizer. I changed in Piper however the symbols and only used those of our alphabet. As I am training from scratch and don't want to fine-tune any existing model, that's probably ok. We are using X-SAMPA by default in our grammars and symbols, but remapping this to IPA is just a lookup. See https://github.com/grammatek/ice-g2p/blob/master/src/ice_g2p/data/sampa_ipa_single_flite.csv We also have an Android App that uses the C++ library Thrax for G2P: https://github.com/grammatek/g2p-thrax. Thrax is totally rule-based and does not perform as good as the other G2P module, but good enough for most purposes. The former uses a BI-LSTM for the conversion which is pretty good. But Icelandic is also very regular with pronunciation and only homographs need to be treated specifically. What we do additionally is using a very big G2P dictionary to speed up our inference time. This just needs to be processed once in a while offline and then you can use it efficiently at runtime. If you are processing a large enough corpus of a specific language, you will get a very good coverage for most words. And homographs can be chosen dynamically depending on some rules/model instead. What we found absolutely necessary for text normalization is to use a PoS tagger. We also have trained a BI-LSTM - based PoS model within our Icelandic language technology program, but there are also some other alternatives available. For Python, e.g. StanfordNLP Stanza with Apache 2.0 license. |
@lumpidu Thanks a lot for sharing. Very informative. I integrated another Phonemizer in Python to use in the Piper training script (basically followed the training guide). The only dependency I didn't install is piper-phonemize (I just stubbed in my own Python module) that returned the expected data for preprocessing (shimmed in all the values from the replacement phonemizer). Because this phonemizer uses different symbols than espeak I also need to train from scratch. Do you mind sharing what hardware you are training on? I can't get piper to train on the GPU on my Apple hardware (and I'm not sure if even if I could get training on the GPU , if it would be fast enough). Google Colab keeps throwing me of my training session before I can finish even though I still have compute credits. Colab feels like a weird system, just throw a paying customer out in the middle of a training session at any time, no questions asked; delete all the data and keep the money! |
@SeymourNickelson : sure, we use our own compute hardware, a Ryzen Threadripper Pro Workstation with 32 cores, 512GB RAM, lots of SSD's and 2x A6000 Nvidia cards. There is also a 3090 card inside, that I mostly use for inferencing. I am currently training an xs modell (with smaller parameter size but 22.05 kHz files) on my 2 x A6000 cards. This model is meant for inferencing on the Android phone. Training runs smoothly with now a bit more than 1500 epochs after overall almost 2 days, i.e. ~ 110 seconds/epoch with a > 17.000 files dataset. Because these cards have 48GB RAM, I use a batch_size of 64, symbol_size of 600 and still the memory is not halfway filled. |
The piper-phonemizer setup is a bit confusing at the moment as it's both a included with some significant code and a library imported at runtime. The two phonemizers text and espeak are both tightly integrated in piper and piper-phonemize. Furthermore, they are linked with the espeak-ng library which has the GPL license meaning piper-phonemize is also under the GPL license (when distributed) and thus also piper is under the GPL license.
My proposal is this:
For example on Linux to phonemize text into a vector of phonemes using espeak:
This would allow an easy way to integrate a new phonemizer without updating both programs and even allows a new library to be added without updating piper-phonemize. Plus, the dependency on espeak-ng would be optional which means it could be distributed under the much more permissive MIT license.
I can implement some of the changes to do this, but as it would be a fairly substantial change, I thought it would be best to discuss it first
The text was updated successfully, but these errors were encountered: