Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feel free to post any issues or questions about Cognitive TTS service here! #128

Open
szhaomsft opened this issue Aug 21, 2019 · 8 comments
Assignees

Comments

@szhaomsft
Copy link
Collaborator

we encourage developers to post issues / questions in this forum.

It is monitored regularly

@szhaomsft
Copy link
Collaborator Author

Welcome post like

  1. Question about the API
  2. Feedbacks about the service
  3. Feature request.
  4. Sample request.

@phly95
Copy link

phly95 commented Sep 9, 2019

It seems the output is limited to 10 minutes of audio (at least using the nural option). What if I want to process a long text file, like a required reading or a chapter of a book?

@bodyzatva
Copy link

bodyzatva commented Sep 10, 2019

I´m using voice pt-PT-HeliaRUS with language pt_PT for a chatbot in a project for a client.
We are facing issues when the bot speaks email addresses.
When i send this text using ssmlSpeak:

"O seu e-mail é <say-as interpret-as="characters">[email protected]"

The email is not being spelled.

I tried other voices and languages like : pt-BR-HeloisaRUS pt_BR and en-US-JessaRUS en-US.

Only in the voice "en-US-Jessa24kRUS" he spells the name.

Can you tell me why ?

By the way , we have a workaround to force spelling that is separate the email text with spaces:
"O seu e-mail é <say-as interpret-as="characters">a n a r e b e l o @ s a p o . p t < / say-as>"

Is this a problem with pt-PT language? And how can i have better results spelling emails correctly ?

@shoutbomb
Copy link

How do I control the pace of the generated speech. I need to slow it down by 10%.
(en-US, JessaNeural)
X-Microsoft-OutputFormat: riff-24khz-16bit-mono-pcm

@hannabonert
Copy link

hannabonert commented Nov 24, 2021

Hello,
I followed the sample from here, and can successfully send 16 kHz audio to the service, and receive a valid response.
How can I use a sampling rate of 44100 hz?

I have tried both of the below, but I get "InitialSilenceTimeout" for every recording that I try.
connection.setRequestProperty("Content-Type", "audio/raw;encoding=signed-integer;bits=16;rate=44100");
connection.setRequestProperty("Content-Type", "audio/wav; codecs=audio/pcm; samplerate=44100");

As a test, I ran the same recording through at 16 kHz, and got a "RecognitionStatus" of "Success". I then resampled it to 44100 kHz, and I got "InitialSilenceTimeout".

I have an example using the speech SDK that works, but now I need to use 44 kHz audio data with the REST API.

Any advice would be greatly appreciated.

Thank you!

@isabirahmed
Copy link

isabirahmed commented Nov 20, 2022

Text to Speech does not set the right pitch if two pitches are set in one request.

Sample SSML:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-SaraNeural">Much against his will Reddy obeyed. <prosody rate="default" pitch="28%" volume="default">“It isn’t the least bit of use,”</prosody> he grumbled, as he trotted towards the Big River. <prosody rate="default" pitch="28%" volume="default">“There won’t be anything there. It is just a waste of time.”</prosody></voice></speak>

I have a sentence with two parts of it set to pitch=28%.
The first part "It isn’t the least bit of use," sounds more like pitch=8% even though its set to 28%
The second part "There won’t be anything there. It is just a waste of time." sounds correct at pitch=28%

Please note this is happening with all the voices and looks like a major bug.
It only happens when you set more than one sentence of the pitch.

Please test this in US East region.
Sample audio file: https://fliki.ai/share/audio/microsoft-pitch-issue-637b4b26dde64016ddbd2a51

@gchiarapa
Copy link

I'm using the Rest API, to synthetize text to speech, but I'd like to know to play the response. Any ideas how to convert and play the response?

My request:

    uri = 'https://brazilsouth.tts.speech.microsoft.com/cognitiveservices/v1';
                    method = 'POST';
                    $http({
                        "method": method,
                        "url": uri,
                        "headers": {
                            "Content-type": "application/ssml+xml",
                            "X-Microsoft-OutputFormat": "audio-16khz-64kbitrate-mono-mp3",
                            "Host":"brazilsouth.tts.speech.microsoft.com",
                            "User-Agent": "",
                            "Authorization": "Bearer " + res.data
                        },
                        data: ''

@Chukarslan
Copy link

This pertains to commit: d457a6d
GPT Streaming response based on TTS text splitter for each sentence.

Is it possible to share the Python / Node version of the code?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants