Realtime/Fastest way to generate stable voice audio locally #882

lev-laptinov · 2025-02-12T12:49:30Z

lev-laptinov
Feb 12, 2025

I want to play generated audio as fast as it possible from my pod, i'm using runpod.io where i run docker image with github repo with start comand
python tools/api_server.py --llama-checkpoint-path checkpoints/fish-speech-1.5 --decoder-checkpoint-path checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth --listen 0.0.0.0:8080 --compile

Then i run the request to this pod like:

headers = {
    'accept': '*/*',
    'Content-Type': 'application/json',
}

json_data = {
    'text': '''text''',
    'chunk_length': 200,
    'format': 'wav',
    'references': [{
            'audio': aud_1,
            'text': text_1
        },
        {
            'audio': aud_2,
            'text': text_2
        }
    ],
    'reference_id': None,
    'seed': 42,
    'use_memory_cache': 'on',
    'normalize': True,
    'streaming': False,
    'max_new_tokens': 1024,
    'top_p': 0.7,
    'repetition_penalty': 1.2,
    'temperature': 0.7,
}

As i understand there is no other option to get stable voice apart from use reference audio

Abour streaming as i understand it returns generated chunks as it are generated:
I've tried to use it using:

p = pyaudio.PyAudio()
audio_format = pyaudio.paInt16  # Assuming 16-bit PCM format
stream = p.open(
    format=audio_format, channels=1, rate=44100, output=True
)

wf = wave.open("output.wav", "wb")
wf.setnchannels(1)
wf.setsampwidth(p.get_sample_size(audio_format))
wf.setframerate(44100)
stream_stopped_flag = False
try:
    start_write = time.time()
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            
            stream.write(chunk)
            wf.writeframesraw(chunk)
        else:
            if not stream_stopped_flag:
                stream.stop_stream()
                stream_stopped_flag = True
finally:
    
    stream.close()
    p.terminate()
    wf.close()

but i didn't get any difference, it starts playing the same time as the whole audio is written

I also used use_memory_cache, it gives increase in speed

I have also tried to finetune it, i increased the time from 6 to 4 secs for audio ~13sec

Now when i run it on 2xRTX 4090 with --compile i get smth ~10sec audio per ~3sec

So like main enhancement as i think is streaming, is it possible to stream audio?
Also maybe i use or understand smth wrong?

hanzalaareeb · 2025-02-12T15:34:11Z

hanzalaareeb
Feb 12, 2025

Can you specify which API endpoint you are using? Because http://LINK/v1/tts has "Content-Type": "application/msgpack". Also, is the inference speed you mentioned measured after generating audio from a chunk or before?

1 reply

lev-laptinov Feb 13, 2025
Author

yeap, i'm using v1/tts,

there is also application/json

About speed i meant from the sending of request until the audio is saved on my machine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realtime/Fastest way to generate stable voice audio locally #882

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Realtime/Fastest way to generate stable voice audio locally #882

lev-laptinov Feb 12, 2025

Replies: 1 comment · 1 reply

hanzalaareeb Feb 12, 2025

lev-laptinov Feb 13, 2025 Author

lev-laptinov
Feb 12, 2025

Replies: 1 comment 1 reply

hanzalaareeb
Feb 12, 2025

lev-laptinov Feb 13, 2025
Author