ONNX streaming support #255

mush42 · 2023-10-27T13:03:49Z

Link to issue number:

Issue #25

Summary of the issue:

Piper uses sentence-level streaming.

For short sentences, the latency of Piper output is relatively low due to the good RTF. But for longer sentences, latency is prohibitively high, which hinders realtime usage applications, such as with screen readers.

Description of how this pull request fixes the issue:

This PR implements streaming output by splitting the VITS model into two parts: encoder and decoder.

First the encoder output is generated for the whole utterance at once, then the encoder output is splitted into chunks of frames using the given chunk size and fed to the decoder chunk by chunk.

To maintain speech quality, each fed chunk is padded with some frames from the previous and next chunk, and then the corresponding wave frames are removed from the final audio output.

To export a checkpoint, use the command:

python3 -m piper_train.export_onnx_streaming CHECKPOINT_PATH ONNX_OUTPUT_DIR

For inference, use the command:

cat input.json | python3 -m piper_train.infer_onnx_streaming --encoder ENCODER_ONNX_PATH --decoder DECODER_ONNX_PATH

Which pipes wave bytes to stdout. You can then redirect the output to any wave playing program.

Testing performed:

Tested export and inference using hfc-male checkpoint.

Known issues with pull request:

The encoder has many components, which can be included in the decoder to further reduce latency, but including those components in the decoder impacts naturalness. There is a trade off to be made between encoder inference speed (latency) and naturalness of generated speech.

For instance, the flow component can be included in the encoder or the decoder. When included in the encoder, it adds significant latency to the encoder. At the same time, chunking the input to the flow component (as a part of the decoder) impacts the speech quality (not verified).

We need to empirically determine which components can be made streamable, and which ones should generate their output at once.

…ments

mush42 · 2023-11-12T08:56:24Z

@synesthesiam
There is a living implementation for this in piper-rs repo.

Do you feel positive about merging this?

Best
Musharraf

mush42 · 2023-12-01T13:46:46Z

@synesthesiam
I think this is ready for merging.

marty1885 · 2023-12-03T04:16:04Z

Just dropping by and saying I love this! I've written my own C++ inference server and this is a major issue I met.

marty1885 · 2023-12-03T04:58:01Z

@mush42 How do I get input.json? I've been trying to generate phoneme IDs manually. But I get no output (0 length in stream). Can you provide an example?

eeejay · 2024-02-21T23:00:59Z

I don't fully understand everything in this pull request, but I have a feeling that this approach can be used to implement word tracking since the sub-sentence phonemes can be synthesized in chunks. It would be cool if the stream API would be available through a PiperVoice.

mush42 · 2024-04-02T00:37:38Z

@eeejay
Phoneme duration is a better option for word tracking.

jaredhagel2 · 2024-08-14T17:27:13Z

The Python torch library used to stream the real-time format Piper voice is large. Our device has limited storage available. Are there any plans on modifying the main piper executable to support streaming these real-time format Piper voices?

marty1885 · 2024-08-14T17:38:14Z

I built paroli and muse42 has his sonata. Both supporting streaming mode Piper models.

jaredhagel2 · 2024-08-14T19:23:39Z

Thanks for this @marty1885! These look great!

mush42 · 2024-08-15T10:06:53Z

@marty1885
BTW I watched the video of your streaming implementation. It can be even better if you apply a window function to each chunk to smooth out abrupt changes at chunk boundaries.
Also, applying a simple fade-in fade-out effect to each chunk would be enough.
You can refer Sonata's source if you want to learn more.

marty1885 · 2024-08-15T10:08:41Z

@mush42 That's already done. The gap you hear is from the WS JS thread not being RT.

Actually I implemented a similarity based search to find the optimal point to concat the audio. I think it works even better then simple fade in and out.

mush42 · 2024-08-15T10:11:30Z

@marty1885 OK I understand.

mush42 · 2024-08-15T10:47:40Z

@marty1885
That's actually very cool. I'll take a look and port your approach to Sonata.
I'm glad I brought it up.

jaredhagel2 · 2024-09-16T13:25:08Z

Would there be value (or is it even feasible) to merge Paroli into Piper? I thought this would be easier than merging Sonata into Piper since Paroli is written in C++. Just an idea from someone who would love to learn a lot more about Paroli, Piper and Sonata (so take with a grain of salt...)

marty1885 · 2024-09-16T15:54:53Z

@synesthesiam What do you think?

The major changes I did to Piper is to abstract the ONNX inference code to allow RKNN (and potentially other accelerators) as a backend. And some API changes to properly support low latency streaming.

The main reason I forked is because the additional dependencies (drogon, libopusenc, soxr) that piper core doesn't need.

jaredhagel2 · 2024-09-17T15:38:25Z

It see a status that states 'This pull request was closed'. I don't have much information in github when this was done. Was this done recently? Is there information on who or why the pull request was closed?

Now that I posted this comment I see 'The pull request was closed' is always below my comment. My guess is that 'This pull request was closed' update was done quite a while ago.

mush42 · 2024-09-17T20:20:28Z

@jaredhagel2 this PR has been merged into Piper as an example of how to implement streaming support. Not implemented in the C++ app though.

jaredhagel2 · 2024-09-17T22:44:36Z

Oh I see. Thank you for the clarification.

mush42 added 2 commits October 27, 2023 14:25

Added ONNX streaming support: export + inference

8bbe064

Chunk size and chunk padding are now controlable via commandline argu…

a691f14

…ments

ye110wd mentioned this pull request Nov 4, 2023

Use ffplay ffmpeg instead of aplay? #258

Open

Move the flow component to the decoder. RTF of encoder is now 0.01

7f66948

mush42 marked this pull request as ready for review November 12, 2023 08:56

pvagner mentioned this pull request Jan 9, 2024

How to package streamable voices mush42/sonata-nvda#35

Closed

synesthesiam merged commit 078bf8a into rhasspy:master Apr 24, 2024

eeejay mentioned this pull request Apr 26, 2024

Support streaming voices project-spiel/speech-provider-piper#1

Closed

ms1design mentioned this pull request Apr 28, 2024

[WIP] Add whisper_streaming package dusty-nv/jetson-containers#460

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX streaming support #255

ONNX streaming support #255

mush42 commented Oct 27, 2023 •

edited

Loading

mush42 commented Nov 12, 2023

mush42 commented Dec 1, 2023

marty1885 commented Dec 3, 2023

marty1885 commented Dec 3, 2023

eeejay commented Feb 21, 2024

mush42 commented Apr 2, 2024

jaredhagel2 commented Aug 14, 2024

marty1885 commented Aug 14, 2024

jaredhagel2 commented Aug 14, 2024

mush42 commented Aug 15, 2024 •

edited

Loading

marty1885 commented Aug 15, 2024 •

edited

Loading

mush42 commented Aug 15, 2024

mush42 commented Aug 15, 2024

jaredhagel2 commented Sep 16, 2024

marty1885 commented Sep 16, 2024

jaredhagel2 commented Sep 17, 2024 •

edited

Loading

mush42 commented Sep 17, 2024

jaredhagel2 commented Sep 17, 2024

ONNX streaming support #255

ONNX streaming support #255

Conversation

mush42 commented Oct 27, 2023 • edited Loading

Link to issue number:

Summary of the issue:

Description of how this pull request fixes the issue:

Testing performed:

Known issues with pull request:

mush42 commented Nov 12, 2023

mush42 commented Dec 1, 2023

marty1885 commented Dec 3, 2023

marty1885 commented Dec 3, 2023

eeejay commented Feb 21, 2024

mush42 commented Apr 2, 2024

jaredhagel2 commented Aug 14, 2024

marty1885 commented Aug 14, 2024

jaredhagel2 commented Aug 14, 2024

mush42 commented Aug 15, 2024 • edited Loading

marty1885 commented Aug 15, 2024 • edited Loading

mush42 commented Aug 15, 2024

mush42 commented Aug 15, 2024

jaredhagel2 commented Sep 16, 2024

marty1885 commented Sep 16, 2024

jaredhagel2 commented Sep 17, 2024 • edited Loading

mush42 commented Sep 17, 2024

jaredhagel2 commented Sep 17, 2024

mush42 commented Oct 27, 2023 •

edited

Loading

mush42 commented Aug 15, 2024 •

edited

Loading

marty1885 commented Aug 15, 2024 •

edited

Loading

jaredhagel2 commented Sep 17, 2024 •

edited

Loading