Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AllTalk update Standard & Streaming generation #98

Merged
merged 5 commits into from
Dec 14, 2024
Merged

Conversation

erew123
Copy link
Contributor

@erew123 erew123 commented Dec 13, 2024

Update AllTalk Integration

  • Added Standard Generation mode as an alternative to Streaming
  • Integrated RVC (voice conversion) support with voice selection
  • Added RVC pitch adjustment (-24 to +24)
  • RVC controls automatically disable when using Streaming mode
  • Standard generation mode set as default

This now opens up Kobold to all the TTS engines that AllTalk supports, as well as the RVC/Voice2voice pipeline. So for example, you can use Piper TTS, which is very low on GPU/CPU RAM and resource, then use the RVC/Voice2voice pipeline to change the TTS output to sound like any RVC based voice you want. Full details here https://github.com/erew123/alltalk_tts/wiki/RVC-(Retrieval%E2%80%90based-Voice-Conversion) along with a link to 100,000+ voices.

Streaming generation only works with the Coqui XTTS engine,

I saw 2x changes that were different from the "lite" DEV branch VS my index.html so I matched those changes in a second commit to my fork. AKA, these changes below are now exactly as they should be to match your dev branch updates and noted this change on this commit erew123@c7c02a4 Image below of what was incorrect/not matching and so I change those to be correct/back to what they should be. (god I hope that makes sense).

image

Update AllTalk Integration
* Added Standard Generation mode as an alternative to Streaming
* Integrated RVC (voice conversion) support with voice selection
* Added RVC pitch adjustment (-24 to +24)
* RVC controls automatically disable when using Streaming mode
* Standard generation mode set as default

This now opens up Kobold to all the TTS engines that AllTalk supports, as well as the RVC/Voice2voice pipeline. So for example, you can use Piper TTS, which is very low on GPU/CPU RAM and resource, then use the RVC/Voice2voice pipeline to change the TTS output to sound like any RVC based voice you want. Full details here https://github.com/erew123/alltalk_tts/wiki/RVC-(Retrieval%E2%80%90based-Voice-Conversion) along with a link to 100,000+ voices.

Streaming generation only works with the Coqui XTTS engine,
@LostRuins
Copy link
Owner

LostRuins commented Dec 14, 2024

Hi @erew123

Looking through your PR now. Seems like a lot has changed - at the moment it doesn't work but it's probably due to the ad-hoc audio element you created. But before I look through that, I wanna check some things

Previously, AllTalk returns Raw audio data from the generate endpoint.

fetch(localsettings.saved_alltalk_url + alltalk_gen_endpoint, {
method: 'POST',
body: formData, // send payload as FormData
})
.then(response => response.arrayBuffer())
.then(data => {
return audioContext.decodeAudioData(data); //this is raw data
})

Is that no longer an option? I notice you now return a URL to a remote resource on server
"output_file_path": "/content/alltalk_tts/outputs/audiofile_173413884301c0b.wav",

  • To me that seems rather risky, since the API is no longer stateless. How long does that .wav stay valid for? Will the download expire? If it doesn't, won't that clutter AllTalk with old wav files from previous generations?
  • You used the exact same API endpoint, does that mean you are breaking backwards compatibility with all clients using prior versions of AllTalk? This also means that once Lite is updated, users running old versions locally will suddenly see it break.
  • Also seems like a potential privacy risk. Anyone with the url can download anyone else's audio. And the files will potentially persist on server side

I'm wondering if you might instead consider a param to the generation request like stateless: true where either

  • it returns the raw wav data in the same request instead
  • it returns a JSON payload containing the audio as a base64 encoded representation

so a second download is not needed? Happy to discuss further.

@LostRuins
Copy link
Owner

I got the non-streaming mode working correctly, and now handle responses from both v1 and v2 based on returned content type.

However, the streaming is still not working - i could not get even the official colab example to stream. Seeing a lot of error 502 responses. Is there a simple setup I can do to get the streaming demo working?

@henk717
Copy link
Collaborator

henk717 commented Dec 14, 2024

In XTTS API server streaming is also not possible remotely as it then plays it trough local speakers.

@LostRuins
Copy link
Owner

LostRuins commented Dec 14, 2024

It's XTTS powered but the server is different. so it might work? Not sure yet

Additionally, once it crashes it wont start again

ERROR:    Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 693, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/content/alltalk_tts/tts_server.py", line 201, in startup_shutdown
    await model_engine.setup()
  File "/content/alltalk_tts/system/tts_engines/vits/model_engine.py", line 170, in setup
    self.available_models = self.scan_models_folder()
  File "/content/alltalk_tts/system/tts_engines/vits/model_engine.py", line 319, in scan_models_folder
    print(f"[{self.branding}ENG] \033[91mWarning\033[0m: Model folder '{model_name}' is missing required")
UnboundLocalError: local variable 'model_name' referenced before assignment

ERROR:    Application startup failed. Exiting.

@LostRuins LostRuins added the help wanted Extra attention is needed label Dec 14, 2024
@LostRuins
Copy link
Owner

Alright I got the streaming endpoint working - but it seems like it doesn't really make much difference - the audio still needs to be fully generated on the backend before the streaming even starts.

Anyway, I will add the streaming endpoint in but buffer it synchronously first.
Please review the final version and I will merge this. This should correctly support alltalk v1 and v2.

@erew123
Copy link
Contributor Author

erew123 commented Dec 14, 2024

Hi @LostRuins I had to travel 100+ miles last night due to a break-in which I am dealing with and hence a little caught up.

That aside, let me do what I can to answer questions.

The streaming setup should have worked for both AllTalk V1 and V2... at least it did when I was testing Locally. I can re-test in a bit. Streaming can be funny on Colab at times, which I think is something to do with cloudflare tunnels and as mentioned, streaming only works with the XTTS engine. Also some browsers are incapable of dealin with the PCM stream e.g. firefox, so will not work when getting a streaming response. Brave & Chrome should be fine. There is a basic HTML test page available on the AllTalk API address

Both Streaming and Standard AllTalk protocols, with code examples are listed here https://github.com/erew123/alltalk_tts/wiki#-api-documentation

As for how long WAV's will remain on the AllTalk server, that will depend on what duration someone sets to auto delete older audio outputs:

image

Default deletion is Disabled.

The difference between the AllTalk V1 and V2 protocol is that on a Standard reposnse, the protocol/IP/Port is not returned in tbe response:

- **AllTalk v2 API (Recommended):**
    - Returns relative file paths only
    - More flexible for different deployments
    - Example response: 
        ```json
        {
            "output_file": "/outputs/tts_output.wav",
            "status": "success"
        }
        ```
    - Best for:
        - New integrations
        - Modern web applications
        - Flexible deployment environments
        - Container-based systems

- **AllTalk v1 API (Legacy):**
    - Returns complete URLs with protocol and IP
    - Maintains older integration support
    - Example response:
        ```json
        {
            "output_file": "http://127.0.0.1:7851/outputs/tts_output.wav",
            "status": "success"
        }
        ```
    - Best for:
        - Existing integrations
        - Systems requiring full URLs
        - Direct file access needs
        - Backward compatibility

I will get back to you at some point, but I currently have to go deal with police, window companies, insruance etc... :/ so will take another look ASAP/when I am free to do so. (Sorry)

@LostRuins
Copy link
Owner

oof take care, no rush on this. i'll merge first, we can always review again as needed.

@LostRuins LostRuins merged commit 6e5f22f into LostRuins:dev Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants