This is a project to develop voice and LipSyncing of text to see if we can create a model where AI can have addition such features.
This project uses regular python and not Neural Networks and ML.
Final Conclusion of this work is- "Just speak to your laptop and NOW AI HAS A FACE WITH VOICE with its Output Response!"
The project involves mainly 3 Sub-topics!
YOUR_VOICE / TEXT(input) ---> AI MODEL reponse text ---> LipSyncing+ Voice ---> AI Response with Audio+Face I have used freely available png's of a man and depending on the input text. I use METAPHONE library of python to create some level of phonetics to change input sentences to meta_word or meta_sentence(in my language). Each character is read and images are consecutively printed with fast speed to see moving mouth. 'Hello world' translates to 'helo vorlt'
Helo is displayed as:
Another attempt as been made to put fade between transition of images.
Three different voice libraries which are pyttsx3, gtts and whisper AI(from openai) are used to successfully produce voice simultaneously with the text.The voice and the image projection run independently and hence for each model, the voice has to be adjusted with different speeds manually. The appropriate readings are written in #comments in the code. You can choose the model and automatically from pre-set data image video and audio will be outputted simulataneously creating an effect of a speaking man.
Text without punctuations have proved almost 90% perfect voice and image_video fluency. Text with punctuation may sometime create discrapency due to uneven voice output of models used. AI models now can use this above made model and can have a face with a voice now! Just input your question (or prompt) to your AI (preferably text-generation model) and it will produce required output with a voice and a face mouthing it! In future as this project goes you may see ChatGPT or other AI's having a face and voice! One of example models made and pushed is Sentiment Analysis AI.
-
The audio libraries used are variable and hard to control as it runs independently. Especially during punctuations, the uneven pause breaks the flow, and some words where it takes an exceptionally long time to speak. For example, the prefix 'un', pyttsx3 speaks as unnnnbiased or unnnable, while my images run u-n-a-b-l-e png's.
-
Since voice and image projection run independently, they tend to deviate sometimes in between due to some words spoken differently compared to meta_word and for long texts, it may go out of sync too.
-
This is why I am not able to properly fix it to gpt-3.5-turbo or gpt-4 because the text it produces has good quantity of punctuation. To maintain flow of voice you cannot just remove the punctuation to make it perfect (that it will speak perfectly).
-
Whisper AI especially is very hard to use because it has an uneven big long pause((for different lengths of sentences) before it starts outputting any voice, so that is not very promising model to use.