-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ONNX streaming support #255
Conversation
@synesthesiam Do you feel positive about merging this? Best |
@synesthesiam |
Just dropping by and saying I love this! I've written my own C++ inference server and this is a major issue I met. |
@mush42 How do I get |
I don't fully understand everything in this pull request, but I have a feeling that this approach can be used to implement word tracking since the sub-sentence phonemes can be synthesized in chunks. It would be cool if the stream API would be available through a |
@eeejay |
The Python torch library used to stream the real-time format Piper voice is large. Our device has limited storage available. Are there any plans on modifying the main piper executable to support streaming these real-time format Piper voices? |
Thanks for this @marty1885! These look great! |
@marty1885 |
@mush42 That's already done. The gap you hear is from the WS JS thread not being RT. Actually I implemented a similarity based search to find the optimal point to concat the audio. I think it works even better then simple fade in and out. |
@marty1885 OK I understand. |
@marty1885 |
Would there be value (or is it even feasible) to merge Paroli into Piper? I thought this would be easier than merging Sonata into Piper since Paroli is written in C++. Just an idea from someone who would love to learn a lot more about Paroli, Piper and Sonata (so take with a grain of salt...) |
@synesthesiam What do you think? The major changes I did to Piper is to abstract the ONNX inference code to allow RKNN (and potentially other accelerators) as a backend. And some API changes to properly support low latency streaming. The main reason I forked is because the additional dependencies (drogon, libopusenc, soxr) that piper core doesn't need. |
It see a status that states 'This pull request was closed'. I don't have much information in github when this was done. Was this done recently? Is there information on who or why the pull request was closed? Now that I posted this comment I see 'The pull request was closed' is always below my comment. My guess is that 'This pull request was closed' update was done quite a while ago. |
@jaredhagel2 this PR has been merged into Piper as an example of how to implement streaming support. Not implemented in the C++ app though. |
Oh I see. Thank you for the clarification. |
Link to issue number:
Issue #25
Summary of the issue:
Piper uses sentence-level streaming.
For short sentences, the latency of Piper output is relatively low due to the good RTF. But for longer sentences, latency is prohibitively high, which hinders realtime usage applications, such as with screen readers.
Description of how this pull request fixes the issue:
This PR implements streaming output by splitting the VITS model into two parts: encoder and decoder.
First the encoder output is generated for the whole utterance at once, then the encoder output is splitted into chunks of frames using the given chunk size and fed to the decoder chunk by chunk.
To maintain speech quality, each fed chunk is padded with some frames from the previous and next chunk, and then the corresponding wave frames are removed from the final audio output.
To export a checkpoint, use the command:
For inference, use the command:
Which pipes wave bytes to
stdout
. You can then redirect the output to any wave playing program.Testing performed:
Tested export and inference using hfc-male checkpoint.
Known issues with pull request:
The encoder has many components, which can be included in the decoder to further reduce latency, but including those components in the decoder impacts naturalness. There is a trade off to be made between encoder inference speed (latency) and naturalness of generated speech.
For instance, the
flow
component can be included in the encoder or the decoder. When included in the encoder, it adds significant latency to the encoder. At the same time, chunking the input to theflow
component (as a part of the decoder) impacts the speech quality (not verified).We need to empirically determine which components can be made streamable, and which ones should generate their output at once.