FYI: Run models from piper with the Next-gen Kaldi subproject sherpa-onnx #251

csukuangfj · 2023-10-26T07:44:29Z

FYI: We have supported piper models in
https://github.com/k2-fsa/sherpa-onnx

Note that it does not depend on https://github.com/rhasspy/piper-phonemize

sherpa-onnx supports a variety of platforms, such as

Windows (x86, x64)
Linux (x64, arm, arm64), i.e., rapsberry Pi
macOS (x64, arm64)

It also provides various programming language APIs, e.g., C/C++/Python/Kotlin/Swift/C#/Go. We also have android APKs for TTS.

You can find the installation doc at https://k2-fsa.github.io/sherpa/onnx/install/index.html

You can find the usage of piper models with sherpa-onnx at
https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#lessac-blizzard2013-medium-english-single-speaker

We also have a huggingface space for you to try piper models with sherpa-onnx.
Please visit
https://huggingface.co/spaces/k2-fsa/text-to-speech

You can find the PR supporting piper in sherpa-onnx at k2-fsa/sherpa-onnx#390

mush42 · 2023-10-26T16:17:03Z

@csukuangfj where to find the Android APKs?

beqabeqa473 · 2023-10-28T10:23:25Z

@csukuangfj Yes, it would be good to know about android tts as well. Could you please tell where to get it?

csukuangfj · 2023-10-29T09:49:13Z

I'm sorry for not getting back to you sooner.

I have been working on converting more models from piper.

Now all models of the following languages have been converted to sherpa-onnx:

English (both US and GB)
French
German
Spanish (both ES and MX)

You can find the Android APKs on the following page.
https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

beqabeqa473 · 2023-10-29T09:58:26Z

Are there using standard android text-to-speech api or not?

…

On 10/29/23, Fangjun Kuang ***@***.***> wrote: I'm sorry for not getting back to you sooner. I have been working on converting more models from piper. Now all models of the following languages have been converted to sherpa-onnx: - English (both US and GB) - French - German - Spanish (both ES and MX) You can find the Android APKs on the following page. https://k2-fsa.github.io/sherpa/onnx/tts/apk.html <img width="901" alt="Screenshot 2023-10-29 at 17 48 27" src="https://github.com/rhasspy/piper/assets/5284924/c36b2eb7-ca4a-411d-8a03-48851a8d2c09"> -- Reply to this email directly or view it on GitHub: #251 (comment) You are receiving this because you are subscribed to this thread. Message ID: ***@***.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: ***@***.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

csukuangfj · 2023-10-29T10:06:09Z

Are there using standard android text-to-speech api or not?

@beqabeqa473

No, it uses sherpa-onnx with vits pre-trained models for tts.

Everything is open-sourced. You can find the source code for the android project at
https://github.com/k2-fsa/sherpa-onnx/tree/master/android/SherpaOnnxTts

The underlying C++ code can be found at https://github.com/k2-fsa/sherpa-onnx

The JNI C++ binding code can be found at
https://github.com/k2-fsa/sherpa-onnx/tree/master/sherpa-onnx/jni

You can find kotlin API examples at
https://github.com/k2-fsa/sherpa-onnx/tree/master/kotlin-api-examples

beqabeqa473 · 2023-10-29T10:16:42Z

Aah, ok, i ment standard tts-engine api bindings. I may try to do it in some future to use this tts as a standard andtoid tts engine for example with screenreaders.

…

On 10/29/23, Fangjun Kuang ***@***.***> wrote: > Are there using standard android text-to-speech api or not? @beqabeqa473 No, it uses sherpa-onnx with vits pre-trained models for tts. Everything is open-sourced. You can find the source code for the android project at https://github.com/k2-fsa/sherpa-onnx/tree/master/android/SherpaOnnxTts The underlying C++ code can be found at https://github.com/k2-fsa/sherpa-onnx The JNI C++ binding code can be found at https://github.com/k2-fsa/sherpa-onnx/tree/master/sherpa-onnx/jni You can find kotlin API examples at https://github.com/k2-fsa/sherpa-onnx/tree/master/kotlin-api-examples -- Reply to this email directly or view it on GitHub: #251 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: ***@***.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

synesthesiam · 2023-10-29T15:26:05Z

Thanks for doing this @csukuangfj! I'd looked into sherpa-onnx at one point, but wasn't sure how to proceed. I'd like to link to your work when you think it's stable enough; I do want to make sure people understand that pronunciations may be slightly different due to the pre-computed lexicon.

Speaking of the lexicon, could it be extended dynamically at runtime with your approach?

csukuangfj · 2023-10-29T15:53:58Z

@synesthesiam

but wasn't sure how to proceed.

We have detailed documentation at
https://k2-fsa.github.io/sherpa/onnx/

Could you tell us what you want to do? We can clarify the doc if you think it is not clear.

I do want to make sure people understand that pronunciations may be slightly different due to the pre-computed lexicon.

The lexicon.txt is generated by following the colab notebook from this repo
https://github.com/rhasspy/piper/blob/master/notebooks/piper_inference_(ONNX).ipynb

The exact code can be found at
https://github.com/csukuangfj/models/tree/master/.github/scripts

Could you explain where the difference comes from?

Speaking of the lexicon, could it be extended dynamically at runtime with your approach?

No, it cannot. If there is an OOV at runtime, it is simply ignored, though a message is printed to tell the user
that an OOV has been ignored.

I'd like to link to your work when you think it's stable enough;

Thank you! I think the support for offline VITS models is stable now. (The APIs for the VITS model are quite simple and
there should be no big changes to the APIs in the near future)

synesthesiam · 2023-11-01T21:41:50Z

Could you tell us what you want to do? We can clarify the doc if you think it is not clear.

I meant more "big picture" in how I should proceed. I wasn't sure if it was worth investigating porting Piper to sherpa-onnx. I'd be curious if you've noticed any speed difference.

csukuangfj · 2023-11-26T15:47:55Z

Thanks for doing this @csukuangfj! I'd looked into sherpa-onnx at one point, but wasn't sure how to proceed. I'd like to link to your work when you think it's stable enough; I do want to make sure people understand that pronunciations may be slightly different due to the pre-computed lexicon.

Speaking of the lexicon, could it be extended dynamically at runtime with your approach?

@synesthesiam
I am integrating piper-phonemize so that we can discard lexicon.txt in sherpa-onnx.

Could you have a look at the following two PRs?

csukuangfj · 2023-11-29T14:03:44Z

https://huggingface.co/csukuangfj/vits-piper-pt_PT-tugao-medium/tree/main

I have converted all of the models from piper to sherpa-onnx.
No lexicon.txt is required any more. I am using piper-phonemize.

(No that you can all run the models on Android/iOS/Raspberry Pi, etc).

anita-smith1 · 2023-12-08T04:03:50Z

@csukuangfj
"No lexicon.txt is required any more. I am using piper-phonemize."

does this apply to piper models only? is lexicon required for coqui tts models? I'm following up on [#257]
(#257)

I couldn't use my coqui tts converted sherpa onyx model because I had to manually add words to lexicon and there was poor pronunciation for single words.

csukuangfj · 2023-12-08T04:18:37Z

is lexicon required for coqui tts models?

No, it is also not required for coqui tts models

All vits models for coqui don't use lexicon.txt for sherpa-onnx.

I couldn't use my coqui tts converted sherpa onyx model because I had to manually add words to lexicon and there was poor pronunciation for single words.

Please look at just one coqui model at
https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models

For instance, you can look at
https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-coqui-en-ljspeech.tar.bz2

Download it, unzip it, and you will find the code for exporting models from coqui to sherpa-onnx.

anita-smith1 · 2023-12-08T14:21:34Z

@csukuangfj meaning your notebook doesn't work anymore ? https://colab.research.google.com/drive/1cI9VzlimS51uAw4uCR-OBeSXRPBc4KoK?usp=sharing

csukuangfj · 2023-12-08T14:45:17Z

@csukuangfj meaning your notebook doesn't work anymore ? https://colab.research.google.com/drive/1cI9VzlimS51uAw4uCR-OBeSXRPBc4KoK?usp=sharing

I just updated the colab notebook. Please reload it.

@anita-smith1

The updated colab notebook is much much simpler than before.

anita-smith1 · 2023-12-08T16:18:37Z

@csukuangfj meaning your notebook doesn't work anymore ? https://colab.research.google.com/drive/1cI9VzlimS51uAw4uCR-OBeSXRPBc4KoK?usp=sharing

I just updated the colab notebook. Please reload it.

@anita-smith1

The updated colab notebook is much much simpler than before.

Your colab notebook works for default vits models, but when I use my fine tuned vits model which contains words like "orrse", "atua" (not in the English dictionary) I get the error Error when reading tokens at Line <PAD> 0. size: 5 when I try to synthesize speech. Seems to be a token.txt issue

The first colab which used lexicons worked, but this does not work with a fine tuned model containing your own words. How can we solve this issue?

csukuangfj · 2023-12-08T22:16:57Z

please show your meta data and add
--debug=1 to your commandline.

anita-smith1 · 2023-12-09T00:58:06Z

--debug=1

meta_data {'model_type': 'vits', 'comment': 'coqui', 'language': 'English', 'voice': 'en-us', 'has_espeak': 1, 'add_blank': 1, 'blank_id': 3, 'n_speakers': 0, 'use_eos_bos': 0, 'bos_id': 2, 'eos_id': 1, 'sample_rate': 22050}

adding --debug=1, I have the output:

/project/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx-offline-tts --vits-model=./model.onnx --vits-tokens=./tokens.txt --vits-data-dir=./espeak-ng-data --output-filename=./test.wav --debug=1 'orrse wo betumi atua de a fa mobile' 

/project/sherpa-onnx/csrc/offline-tts-vits-model.cc:Init:79 ---vits model---
bos_id=2
use_eos_bos=0
n_speakers=0
blank_id=3
has_espeak=1
voice=en-us
sample_rate=22050
language=English
add_blank=1
comment=coqui
eos_id=1
model_type=vits
----------input names----------
0 input
1 input_lengths
2 scales
----------output names----------
0 output


/project/sherpa-onnx/csrc/piper-phonemize-lexicon.cc:ReadTokens:66 Error when reading tokens at Line <PAD> 0. size: 5
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
[<ipython-input-13-c8218415962b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('shell', '', '\nsherpa-onnx-offline-tts \\\n --vits-model=./model.onnx \\\n --vits-tokens=./tokens.txt \\\n --vits-data-dir=./espeak-ng-data \\\n --output-filename=./test.wav \\\n --debug=1 \\\n "orrse wo betumi atua de a fa mobile"\n')

3 frames
[/usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py](https://localhost:8080/#) in check_returncode(self)
    135   def check_returncode(self):
    136     if self.returncode:
--> 137       raise subprocess.CalledProcessError(
    138           returncode=self.returncode, cmd=self.args, output=self.output
    139       )

CalledProcessError: Command '
sherpa-onnx-offline-tts \
 --vits-model=./model.onnx \
 --vits-tokens=./tokens.txt \
 --vits-data-dir=./espeak-ng-data \
 --output-filename=./test.wav \
 --debug=1 \
 "orrse wo betumi atua de a fa mobile"
' returned non-zero exit status 255.

and this is the generated token.txt file content:

<PAD> 0
<EOS> 1
<BOS> 2
<BLNK> 3
a 4
b 5
c 6
d 7
e 8
f 9
h 10
i 11
j 12
k 13
l 14
m 15
n 16
o 17
p 18
q 19
r 20
s 21
t 22
u 23
v 24
w 25
x 26
y 27
z 28
æ 29
ç 30
ð 31
ø 32
ħ 33
ŋ 34
œ 35
ǀ 36
ǁ 37
ǂ 38
ǃ 39
ɐ 40
ɑ 41
ɒ 42
ɓ 43
ɔ 44
ɕ 45
ɖ 46
ɗ 47
ɘ 48
ə 49
ɚ 50
ɛ 51
ɜ 52
ɞ 53
ɟ 54
ɠ 55
ɡ 56
ɢ 57
ɣ 58
ɤ 59
ɥ 60
ɦ 61
ɧ 62
ɨ 63
ɪ 64
ɫ 65
ɬ 66
ɭ 67
ɮ 68
ɯ 69
ɰ 70
ɱ 71
ɲ 72
ɳ 73
ɴ 74
ɵ 75
ɶ 76
ɸ 77
ɹ 78
ɺ 79
ɻ 80
ɽ 81
ɾ 82
ʀ 83
ʁ 84
ʂ 85
ʃ 86
ʄ 87
ʈ 88
ʉ 89
ʊ 90
ʋ 91
ʌ 92
ʍ 93
ʎ 94
ʏ 95
ʐ 96
ʑ 97
ʒ 98
ʔ 99
ʕ 100
ʘ 101
ʙ 102
ʛ 103
ʜ 104
ʝ 105
ʟ 106
ʡ 107
ʢ 108
ʲ 109
ˈ 110
ˌ 111
ː 112
ˑ 113
˞ 114
β 115
θ 116
χ 117
ᵻ 118
ⱱ 119
! 120
' 121
( 122
) 123
, 124
- 125
. 126
: 127
; 128
? 129
  130

csukuangfj · 2023-12-09T01:38:43Z

Could you share your config.json?

The English VITS models from coqui use phonemes. All other non-English models from coqui use Characters.

csukuangfj · 2023-12-09T09:01:16Z

From your config.json:

    "characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",

Unfortunately, we don't support models using IPAPhonemes, only Graphmes and VitsCharacters are supported
from coqui-ai/tts.

You can find all supported models at
https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models

You can find the script for converting the model by unzipping the downloaded file.

anita-smith1 · 2023-12-09T09:18:47Z

@csukuangfj how can I fine-tune my model to support this ? I shared the colab notebook I used in my previous message. Can you take a look ? Is it possible to change the configuration and re-fine tune my model? In case that’s not possible and I decide to train/fine tune using piper , do you have a similar colab notebook for converting piper model to onnx ?

csukuangfj · 2023-12-09T09:27:57Z

Please download a model and unzip it, you will find the converting script.

anita-smith1 · 2023-12-10T00:56:15Z

@csukuangfj I have fine tuned a model with characters_class="TTS.tts.models.vits.VitsCharacters" and I'm able to synthesis now using your colab notebook. it is working :) Thanks a lot. Now I want to try on android and iOS but I can see android uses the old code below. Will it ignore the lexicon file?

fun getOfflineTtsConfig(
    modelDir: String,
    modelName: String,
    lexicon: String,
    dataDir: String,
    ruleFsts: String
): OfflineTtsConfig? {
    return OfflineTtsConfig(
        model = OfflineTtsModelConfig(
            vits = OfflineTtsVitsModelConfig(
                model = "$modelDir/$modelName",
                lexicon = "$modelDir/$lexicon",
                tokens = "$modelDir/tokens.txt",
                dataDir = "$dataDir"
            ),
            numThreads = 2,
            debug = true,
            provider = "cpu",
        ),
        ruleFsts = ruleFsts,
    )

csukuangfj · 2023-12-10T01:01:08Z

please see where and how this function is called.

csukuangfj · 2023-12-10T02:11:18Z

Please see
https://github.com/k2-fsa/sherpa-onnx/blob/master/android/SherpaOnnxTts/app/src/main/java/com/k2fsa/sherpa/onnx/MainActivity.kt#L172

https://github.com/k2-fsa/sherpa-onnx/blob/0f053d80408b70efde3c8a37f5eeed1c5fd7f837/android/SherpaOnnxTts/app/src/main/java/com/k2fsa/sherpa/onnx/MainActivity.kt#L167-L183

        // Example 1:
        // modelDir = "vits-vctk"
        // modelName = "vits-vctk.onnx"
        // lexicon = "lexicon.txt"

        // Example 2:
        // https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models
        // https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-amy-low.tar.bz2
        // modelDir = "vits-piper-en_US-amy-low"
        // modelName = "en_US-amy-low.onnx"
        // dataDir = "vits-piper-en_US-amy-low/espeak-ng-data"

        // Example 3:
        // modelDir = "vits-zh-aishell3"
        // modelName = "vits-aishell3.onnx"
        // ruleFsts = "vits-zh-aishell3/rule.fst"
        // lexcion = "lexicon.txt"

In your case, please use Example 2.

@anita-smith1

anita-smith1 · 2023-12-10T02:44:26Z

@csukuangfj Thanks a lot for your patience. I'm learning a lot as a beginner. I have run the android app with version 1.9.3 .so files and it worked but I had to make some changes to the initAudioTrack() function. It crashed with an invalid audio buffer size :

java.lang.RuntimeException: Unable to start activity ComponentInfo{com.k2fsa.sherpa.onnx/com.k2fsa.sherpa.onnx.MainActivity}: java.lang.IllegalArgumentException: Invalid audio buffer size.
                                                                                                    	at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:4184)
                                                                                                    	at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:4340)
                                                                                                    	at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:101)
                                                                                                    	at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:135)
                                                                                                    	at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:95)
                                                                                                    	at android.app.ActivityThread$H.handleMessage(ActivityThread.java:2584)
                                                                                                    	at android.os.Handler.dispatchMessage(Handler.java:106)
                                                                                                    	at android.os.Looper.loopOnce(Looper.java:226)
                                                                                                    	at android.os.Looper.loop(Looper.java:313)
                                                                                                    	at android.app.ActivityThread.main(ActivityThread.java:8810)
                                                                                                    	at java.lang.reflect.Method.invoke(Native Method)
                                                                                                    	at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:604)
                                                                                                    	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1067)
                                                                                                    Caused by: java.lang.IllegalArgumentException: Invalid audio buffer size.
                                                                                                    	at android.media.AudioTrack.audioBuffSizeCheck(AudioTrack.java:1955)
                                                                                                    	at android.media.AudioTrack.<init>(AudioTrack.java:810)
                                                                                                    	at android.media.AudioTrack.<init>(AudioTrack.java:752)
                                                                                                    	at com.k2fsa.sherpa.onnx.MainActivity.initAudioTrack(MainActivity.kt:78)
                                                                                                    	at com.k2fsa.sherpa.onnx.MainActivity.onCreate(MainActivity.kt:40)
                                                                                                    	at android.app.Activity.performCreate(Activity.java:8657)
                                                                                                    	at android.app.Activity.performCreate(Activity.java:8636)
                                                                                                    	at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1417)
                                                                                                    	at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:4165)
                                                                                                    	at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:4340) 
                                                                                                    	at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:101) 
                                                                                                    	at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:135) 
                                                                                                    	at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:95) 
                                                                                                    	at android.app.ActivityThread$H.handleMessage(ActivityThread.java:2584) 
                                                                                                    	at android.os.Handler.dispatchMessage(Handler.java:106) 
                                                                                                    	at android.os.Looper.loopOnce(Looper.java:226) 
                                                                                                    	at android.os.Looper.loop(Looper.java:313) 
                                                                                                    	at android.app.ActivityThread.main(ActivityThread.java:8810) 
                                                                                                    	at java.lang.reflect.Method.invoke(Native Method) 
                                                                                                    	at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:604) 
                                                                                                    	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1067)

I had to change the original to the the version below which worked, but I'm not sure if it has any implications:

private fun initAudioTrack() {
        val sampleRate = tts.sampleRate()
        val minBufferSize = AudioTrack.getMinBufferSize(
            sampleRate,
            AudioFormat.CHANNEL_OUT_MONO,
            AudioFormat.ENCODING_PCM_FLOAT
        )

        // Check if getMinBufferSize returned a valid size
        if (minBufferSize == AudioTrack.ERROR || minBufferSize == AudioTrack.ERROR_BAD_VALUE) {
            Log.e(TAG, "Invalid minimum buffer size: $minBufferSize")
            return
        }

        // Ensure buffer size is at least 0.1 seconds of audio or the minimum buffer size, whichever is larger
        val bufLength = max((sampleRate * 0.1).toInt(), minBufferSize)
        Log.i(TAG, "sampleRate: $sampleRate, bufLength: $bufLength")

        val attr = AudioAttributes.Builder()
            .setContentType(AudioAttributes.CONTENT_TYPE_SPEECH)
            .setUsage(AudioAttributes.USAGE_MEDIA)
            .build()

        val format = AudioFormat.Builder()
            .setEncoding(AudioFormat.ENCODING_PCM_FLOAT)
            .setChannelMask(AudioFormat.CHANNEL_OUT_MONO)
            .setSampleRate(sampleRate)
            .build()

        try {
            track = AudioTrack(attr, format, bufLength, AudioTrack.MODE_STREAM, AudioManager.AUDIO_SESSION_ID_GENERATE)

            // Check if AudioTrack is initialized properly
            if (track.state != AudioTrack.STATE_INITIALIZED) {
                Log.e(TAG, "AudioTrack initialization failed")
                return
            }

            track.play()
        } catch (e: IllegalArgumentException) {
            Log.e(TAG, "AudioTrack initialization failed: ${e.message}")
        }
    }

csukuangfj · 2023-12-10T03:37:18Z

Thanks! Would you mind making a PR to fix it?

anita-smith1 · 2023-12-10T14:02:41Z

Thanks! Would you mind making a PR to fix it?

The working code is from ChatGPT. I don't know why it works. I asked it why the app crashed and it told me why with a solution. I think you need to first check and confirm it does not cause any other issue before making a PR. Example, in your recent video on twitter (X), synthesis is very fast but mine is a bit slow, so not sure if it's due to the code. Thanks

csukuangfj · 2023-12-10T14:24:22Z

I just fixed it in the master branch.

I am using a small model in the video. How large is your model?

anita-smith1 · 2023-12-10T14:52:42Z

Okay that's great. Hope you will soon fix the single word pronunciation issue too. My model size is 145MB

csukuangfj · 2023-12-13T03:15:10Z

is there any chance you can bring back support for models using IPAPhonemes?

@anita-smith1

Sorry, it is not in the plan. The major difficulty is that the phonemizer used by IPAPhonemes is hard to port to C++.

As you know, you are training your model in Python, but if you want to deploy it, every part must be converted to C++, including the phonemizer.

All the VITS models from coqui-ai/tts are listed below.

# Graphemes
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--bg--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--cs--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--da--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--et--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--ga--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--es--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--fr--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--nl--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--de--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--hu--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--fi--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--hr--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--lt--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--lv--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--mt--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--pl--mai_female--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--pt--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--ro--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--sk--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--sl--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--sv--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.13.3_models/tts_models--bn--custom--vits_male.zip
# wget https://coqui.gateway.scarf.sh/v0.13.3_models/tts_models--bn--custom--vits_female.zip

# IPAPhonemes
# wget https://coqui.gateway.scarf.sh/v0.7.0_models/tts_models--de--thorsten--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--el--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.10.1_models/tts_models--ca--custom--vits.zip

# VitsCharacters
# wget https://coqui.gateway.scarf.sh/v0.6.1_models/tts_models--it--mai_female--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.6.1_models/tts_models--it--mai_male--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--ewe--openbible--vits.zip # ewe
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--hau--openbible--vits.zip # hausa
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--lin--openbible--vits.zip # lingala
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--tw_akuapem--openbible--vits.zip # akuapem-twi
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--tw_asante--openbible--vits.zip # asante-twi
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--yor--openbible--vits.zip # yoruba

You can see that only 3 of them are using IPAPhonemes.

I suggest that you switch to

"characters_class": "TTS.tts.utils.text.characters.Graphemes",

or

"characters_class": "TTS.tts.models.vits.VitsCharacters",

csukuangfj · 2023-12-13T03:16:31Z

@anita-smith1

. I have noticed that my fine tuned model using IPAPhonemes for non English words (like names of people), has way better quality than the version using VitsCharacter.

You can also use espeak-ng in coqui-ai/tts, though I find that only English VITS models from coqui-ai/tts are using espeak-ng.

aaronnewsome · 2023-12-13T04:21:17Z

@aaronnewsome

I just wrote a detailed, step-by-step, guide about how to convert a piper vits pre-trained model to sherpa-onnx for you. You can find it at https://k2-fsa.github.io/sherpa/onnx/tts/piper.html

Thank you @csukuangfj , I honestly don't think I stumbled across all of these instructions while I was trying to do the conversion for the hours I was trying. It was much easier to do with the instructions you created.

I was able to use the sherpa-onnx-offline-tts example to create a wav with my custom voice trained from scratch. However, the quality was not very good at all. Lots of words with strange pronunciations. The words were pronounced much more accurately piper.

Also, the JSON file that piper preprocess created for me needed some changes for your script to run. The language key and espeak key didn't look the same as the en_US-amy-medium.onnx.json file I compared it to. In en_US-amy-medium.onnx.json there is:

"espeak": {
    "voice": "en-us"
  }

and

"language": {
    "code": "en_US",
    "family": "en",
    "region": "US",
    "name_native": "English",
    "name_english": "English",
    "country_english": "United States"
  },

The json for my custom voice, trained from scratch only had this for language:

"language": {
        "code": "en"
    },

and also just "en" for espeak voice. This caused your example python script to error, so I adjusted the JSON manually. The JSON file for my onnx was created by piper preprocess, so maybe I used it wrong, which would explain why those fields are wrong/missing. I'll look into it some more.

anita-smith1 · 2023-12-13T11:33:37Z

@csukuangfj Please check if my configuration for fine tuning a Vits model using coqui is okay. I am not getting intelligible sound after fine tuning using VitsCharacter, even for English words/phrases. Seems I am doing something wrong:

code = """import os

from trainer import Trainer, TrainerArgs

from TTS.tts.configs.shared_configs import BaseDatasetConfig, CharactersConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsAudioConfig
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
#output_path = os.path.dirname(os.path.abspath(__file__))
##########################################
#Change this to your dataset directory
##########################################
output_path = "/content/drive/MyDrive/"""
code = code + dataset_name + "/" + output_directory + "/" + "\""

code=code + """
dataset_config = BaseDatasetConfig(
##########################################
#Change this to your dataset directory
##########################################
    formatter="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "/content/drive/MyDrive/"""
code = code + dataset_name
code=code + """")

)
audio_config = VitsAudioConfig(
    sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)
#i have added character config for sherpa onnx support
character_config = CharactersConfig (
     characters_class="TTS.tts.models.vits.VitsCharacters",
     pad="_",
     eos="",
     bos="",
     characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
     punctuations=';:,.!?¡¿—…"«»“” ',
     phonemes="ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
)
config = VitsConfig(
    audio=audio_config,
    characters=character_config,
    run_name="vits_ljspeech_ly",
    batch_size=16,
    eval_batch_size=16,
    batch_group_size=5,
#    num_loader_workers=8,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=100000,
    save_step=1000,
	save_checkpoints=True,
	save_n_checkpoints=4,
	save_best_after=2000,
    text_cleaner="english_cleaners",
    use_phonemes=True,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=True,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    cudnn_benchmark=False,
)
# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# init model
model = Vits(config, ap, tokenizer, speaker_manager=None)

# init the trainer and 🚀
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()
"""

I read this and seems he fixed the issue by setting "use_phonemes=False", but I don't think that applies here.

csukuangfj · 2023-12-13T12:04:12Z

Sorry that I am not familiar with coqui-ai/tts. I suggest that you ask in the repo of coqui-ai/tts.

anita-smith1 · 2023-12-13T12:23:57Z

okay no problem. I am switching from coqui to piper since I'm facing some issues.

anita-smith1 · 2023-12-13T12:45:42Z

I am currently training using "use_phonemes=False" (coqui tts) and seems to be working so far. If it still doesn't work I will switch completely to piper. Piper has very good documentation

anita-smith1 · 2023-12-15T13:50:05Z

So I managed to get both coqui tts and piper working but I have decided to stick to piper because the model size is smaller than coqui tts therefore reducing latency. Piper seems to have better pronunciations too.

@csukuangfj I am not sure if you need to update script in model zip file.

pip install piper-phonemize onnx onnxruntime==1.16.0 returns:

ERROR: Could not find a version that satisfies the requirement piper-phonemize (from versions: none)
ERROR: No matching distribution found for piper-phonemize

changing the version to 1.16.1 doesn't work either.

so I changed to pip install onnx onnxruntime.

Also, I had to manually change the json file to include:

"language": {
    "code": "en_US",
    "family": "en",
    "region": "US",
    "name_native": "English",
    "name_english": "English",
    "country_english": "United States"
  }

because the original export from piper only had

language": {
        "code": "en-us"
    }

Without changing the python script for exporting to sherpa-onnx will fail at :

"language": config["language"]["name_english"],

since there is no "name_english"

csukuangfj · 2023-12-31T14:49:57Z

Aah, ok, i ment standard tts-engine api bindings. I may try to do it in some future to use this tts as a standard andtoid tts engine for example with screenreaders.
…
On 10/29/23, Fangjun Kuang @.> wrote: > Are there using standard android text-to-speech api or not? @beqabeqa473 No, it uses sherpa-onnx with vits pre-trained models for tts. Everything is open-sourced. You can find the source code for the android project at https://github.com/k2-fsa/sherpa-onnx/tree/master/android/SherpaOnnxTts The underlying C++ code can be found at https://github.com/k2-fsa/sherpa-onnx The JNI C++ binding code can be found at https://github.com/k2-fsa/sherpa-onnx/tree/master/sherpa-onnx/jni You can find kotlin API examples at https://github.com/k2-fsa/sherpa-onnx/tree/master/kotlin-api-examples -- Reply to this email directly or view it on GitHub: #251 (comment) You are receiving this because you were mentioned. Message ID: @.>
-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

@beqabeqa473

I just supported replacing the system TTS engine in k2-fsa/sherpa-onnx#508

You can find a YouTube video at
https://www.youtube.com/watch?v=33QYuVzDORA

nanaghartey · 2024-04-08T06:32:47Z

@csukuangfj when will Sherpa support coqui XTTS-v2 models?

csukuangfj · 2024-04-08T06:36:18Z

XTTS-v2

The model is larger than 1 GB, which requires a GPU, I think.

We won't support it in k2-fsa/sherpa-onnx, which is targeted mainly for embedded environment.

But we may support it in k2-fsa/sherpa, though we cannot say a time when it will be supported.

nanaghartey · 2024-04-25T00:51:22Z

@csukuangfj what about StyleTTS2 models which has elevenlabs human sounding quality and pytorch support https://github.com/yl4579/StyleTTS2

csukuangfj · 2024-04-25T01:25:33Z

https://github.com/yl4579/StyleTTS2

Does it have onnx export support?

nanaghartey · 2024-04-25T01:31:43Z

https://github.com/yl4579/StyleTTS2

Does it have onnx export support?

Not at the moment

nanaghartey · 2024-04-25T01:39:28Z

@csukuangfj currently, which model sounds close to human quality on sherpa onnx? Coqui or piper tts models? And are these two the only shpera onnx supports?

csukuangfj · 2024-04-25T02:05:19Z

Please visit
https://huggingface.co/spaces/k2-fsa/text-to-speech
to try all supported tts models.

There are more than 100 tts models and the best way to find out which model sounds best to you is to try it by yourself.
You don't need to install anything to try it.

csukuangfj · 2024-04-25T02:06:14Z

And are these two the only shpera onnx supports?

No.

shepra-onnx currently supports VITS tts models and it is not limited to coqui or piper.

nanaghartey · 2024-04-25T09:58:15Z

Please visit

https://huggingface.co/spaces/k2-fsa/text-to-speech

to try all supported tts models.

There are more than 100 tts models and the best way to find out which model sounds best to you is to try it by yourself.

You don't need to install anything to try it.

I tried a couple of them in the past actually. I was hoping you'd have a "top 3" model list. What I noticed with sherpa onnx is there's a trade off between quality & on-device processing compared to cloud solutions out there.
Example standard coqui tts models sound okay but once converted to sherpa onnx the quality and intonation goes down. Are there any tips or tricks to get a good quality on sherpa onnx?

csukuangfj · 2024-04-26T05:24:45Z

Example standard coqui tts models sound okay but once converted to sherpa onnx the quality and intonation goes down

Could you describe which model you are using? @nanaghartey

nanaghartey · 2024-04-26T05:35:49Z

Example standard coqui tts models sound okay but once converted to sherpa onnx the quality and intonation goes down

Could you describe which model you are using? @nanaghartey

I'm using my own fine tuned coqui and piper tts vits models. Both sound good before converting to sherpa onnx...but this is the case for the various other English models I tried out

nanaghartey · 2024-07-04T05:28:30Z

@csukuangfj Please take a look at this issue on StyleTTS2 - #117
Since someone has successfully converted to onnx, can you also convert to support sherpa onnx? if this can be achieved, sherpa Onnx will have human-level/high quality realistic TTS

csukuangfj · 2024-07-04T05:40:34Z

We have already supported Piper. Is there anything special with Style TTS2 @nanaghartey

nanaghartey · 2024-07-04T05:47:39Z

@csukuangfj Piper can't be compared to StyleTTS2. StyleTTS2 is currently the only open source solution close to proprietary solutions like elevenlabs, open ai's tts, recent gemini voices..
You can compare your k2-fsa quality with an onnx implementation of styleTTS2 here to see the difference

csukuangfj · 2024-07-04T06:16:53Z

What is the model size of StyleTTS2? Does it require GPU?

Could you post the link to the inference script with onnx for StyleTTS2?

nanaghartey · 2024-07-04T06:28:45Z

@csukuangfj The author who converted to onnx has not shared the script yet. I was thinking you'd take a look at the repo and see if it's something you can work on. As you can see from the thread, others are trying to export to onnx

csukuangfj · 2024-07-04T06:35:02Z

I was thinking you'd take a look at the repo and see if it's something you can work on.

Sorry, I don't have extra time to do that. If there are existing ONNX inference scripts, I can take a look.

nanaghartey · 2024-07-04T06:45:57Z

@csukuangfj No problem. I will share the scripts once it's available. Thanks

DavidDohmen · 2024-11-27T19:00:38Z

Is this discussion still related to Rhasspy/Piper or has it drifted to another (impressive) project?
I mainly ask, because I'd love to see locally performed TTS natively in android for the Home Assistant use case.
Many HA users have android tablet, which are always on and constantly have the Home Assistant Companion app in the foreground. Running as many models/parts of the stack as possible on the device would have many benefits.
Faster responses, easier configuration, more processing power than Raspberry Pis e.g.

What's the status here and are there any ambitions to work in this direction? I don't have time but I would help funding this.

csukuangfj · 2024-11-29T06:20:05Z

@DavidDohmen

sherpa-onnx provides runtime supports for models from various frameworks, including those from piper.

sherpa-onnx does not provide support to train your models, but piper does that.

Different from piper, sherpa-onnx provides support for various platforms and programming languages.

For instance, you can run piper models with sherpa-onnx on iOS, Android, Linux, windows, macoOS, etc.

Also, sherpa-onnx supports not only text-to-speech, but it also supports speech-to-text, speaker diarization, etc.

DavidDohmen · 2024-12-02T09:30:57Z

Thanks, yes - I understood these differences. Especially the portabililty to other OSes like Android are super valuable and my question is basically if there are considerations to bring the sherpa-onnx functionality into the HA companion Android app.

FYI: Run models from piper with the Next-gen Kaldi subproject sherpa-onnx #251

FYI: Run models from piper with the Next-gen Kaldi subproject sherpa-onnx #251

Comments

csukuangfj commented Oct 26, 2023

mush42 commented Oct 26, 2023

beqabeqa473 commented Oct 28, 2023

csukuangfj commented Oct 29, 2023

beqabeqa473 commented Oct 29, 2023 via email

csukuangfj commented Oct 29, 2023

beqabeqa473 commented Oct 29, 2023 via email

synesthesiam commented Oct 29, 2023

csukuangfj commented Oct 29, 2023

synesthesiam commented Nov 1, 2023

csukuangfj commented Nov 26, 2023

csukuangfj commented Nov 29, 2023

anita-smith1 commented Dec 8, 2023 • edited Loading

csukuangfj commented Dec 8, 2023

anita-smith1 commented Dec 8, 2023

csukuangfj commented Dec 8, 2023

anita-smith1 commented Dec 8, 2023 • edited Loading

csukuangfj commented Dec 8, 2023

anita-smith1 commented Dec 9, 2023

csukuangfj commented Dec 9, 2023

csukuangfj commented Dec 9, 2023

anita-smith1 commented Dec 9, 2023

csukuangfj commented Dec 9, 2023

anita-smith1 commented Dec 10, 2023

csukuangfj commented Dec 10, 2023

csukuangfj commented Dec 10, 2023

anita-smith1 commented Dec 10, 2023 • edited Loading

csukuangfj commented Dec 10, 2023

anita-smith1 commented Dec 10, 2023

csukuangfj commented Dec 10, 2023

anita-smith1 commented Dec 10, 2023 • edited Loading

csukuangfj commented Dec 13, 2023

csukuangfj commented Dec 13, 2023

aaronnewsome commented Dec 13, 2023

anita-smith1 commented Dec 13, 2023

csukuangfj commented Dec 13, 2023

anita-smith1 commented Dec 13, 2023

anita-smith1 commented Dec 13, 2023

anita-smith1 commented Dec 15, 2023

csukuangfj commented Dec 31, 2023

nanaghartey commented Apr 8, 2024

csukuangfj commented Apr 8, 2024

nanaghartey commented Apr 25, 2024

csukuangfj commented Apr 25, 2024

nanaghartey commented Apr 25, 2024

nanaghartey commented Apr 25, 2024

csukuangfj commented Apr 25, 2024

csukuangfj commented Apr 25, 2024

nanaghartey commented Apr 25, 2024

csukuangfj commented Apr 26, 2024

nanaghartey commented Apr 26, 2024

nanaghartey commented Jul 4, 2024 • edited Loading

csukuangfj commented Jul 4, 2024

nanaghartey commented Jul 4, 2024

csukuangfj commented Jul 4, 2024

nanaghartey commented Jul 4, 2024

csukuangfj commented Jul 4, 2024

nanaghartey commented Jul 4, 2024

DavidDohmen commented Nov 27, 2024

csukuangfj commented Nov 29, 2024

DavidDohmen commented Dec 2, 2024

anita-smith1 commented Dec 8, 2023 •

edited

Loading

anita-smith1 commented Dec 8, 2023 •

edited

Loading

anita-smith1 commented Dec 10, 2023 •

edited

Loading

anita-smith1 commented Dec 10, 2023 •

edited

Loading

nanaghartey commented Jul 4, 2024 •

edited

Loading