✏️This note record our project and course material of the NCTU course Introduction to Artificial Intelligence in 2021 fall(110上學期)🏫
Copyright© of the project is belonged to our Project Member:Chang,Yu-Jen、Chen,Chih-Cheng、Chiang,Yung-chu. Note that no one shall download, reprint, reproduce, distribute, publicly broadcast, publish or distribute the research content of our project research in any form without written consent.
🏆 Award : Best voice application in the AI course.
When it comes to voice faker, it downs on us that we often use the APP for instant speech recognition and text conversion in class or listening to speeches. We are interested in the recognition method of audio files such as human voice, so we would like to take the opportunity of the implementation of this topic to understand the logic more, and carry out some words. Simple identification model construction. This research will select a public speech data set—Common voice Dataset, and through the data preprocessing process, use the wavfile of Pytorch's torchaudio package to visualize the frequency of the audio file into a picture, and then train it through a deep learning model , so that it can recognize some simple words, and improve the accuracy by adjusting the parameters.
-
Goals: We expected to input the sound from one person and output the sound from another person but with the same semantics.If everything is good, we can input a string and choose which tones we want to use and make AI to fake our voices. :arrow_right: help the lazy guys and the speaking handicapped.
-
Describes: This project is interesting because deepfake is popular, and we can combine them together. Then we can fake someone totally.(ideally)
-
Topic related courses: 1. Deep learning 2. Convolutional neural network(CNN) 3. Recurrent neural network(RNN) 4. classification 5. feature map
-
Flow chart of our project
graph LR
A(Voice Faker Project) -->B(Melspectrogram classification)--> C(MFCC audio to text)--> D(text audio map)--> E(text to audio)--> F(Done)
- Related prior work :
- Using AI to fake sounds: full paper link ◆ Deep4SNet is a text-independent classifier of original/fake speech recordings. ◆ It is based on a customized deep learning architecture. ◆ Speech recordings are transformed into histograms to feed the model. ◆ Experimental results are performed on Deep Voice and Imitation datasets. ◆ The accuracy of the classifier is over 98%. 2. the article applied by Deep voice
- the article applied by Deep voice: full paper link ◆ Demonstrate and analyze the strength of speaker adaption approaches for voice cloning, based on fine-tuning a pre-trained multi-speaker model. ◆ Propose a novel speaker encoding approach, which provides comparable naturalness and similarity in subjective evaluations while yielding significantly less cloning time and computational resource requirements. ◆ Propose automated evaluation methods for voice cloning based on neural speaker classification and speaker verification. ◆ Demonstrate voice morphing for gender and accent transformation via embedding manipulations.
- Possible problems: 1. How to process the tones from the same person. 2. How to raise the precision.
- Plan: 11/22: Problem design (1/2) read and discuss four articles, learning the technique of deep voice. 11/29: Problem design (2/2) read and discuss two papers, trying to come up with further insteresting problems base on these papers or use these papers as benchmarks. 12/6: Method and experiment design (1/3) sound recognition 12/13: Method and experiment design (2/3) sound classification. 12/20: Method and experiment design (3/3) sound fake. 12/27: End //review the whole work and composed a final project.
Our theme is Voice faker, which can help people who are too lazy to speak or have difficulty in pronunciation to generate their own voices. In the last project proposal, we carried out according to the original plan. Before this week, we finished reading four deep voice technology teaching articles, two related research papers, and the first stage of experimental design for identifying sound. At present, colab has been used to do the second stage of experimental design classification, and the model has a 70% accuracy. Among them, sound recognition will use common voice dataset and CTC and Beam search algorithms to train Chinese and English, but it has not been trained yet. At present, the biggest problem we encounter is still studying how to use classification in actual files for feature block analysis. Our project proposal update is probably over here.
-
Our procedure:
-
(1) What is sound and how it is digitized. What problems is audio deep learning solving in our daily live. What are Spectrograms and why they are all-important.
-
(2) Done Why Mel Spectrograms perform better and how to generate them. :arrow_right: The classification model we have built in Colab has about 70% accuracy rate.
-
(3) Enhance Spectrograms features for optimal performance by hyper-parameter tuning and data augmentation. Using MFCC (Mel Frequency Cepstral Coefficients) instead of Mel spectrogram.
-
(4) Speech-to-Text algorithm and architecture, using CTC (Connectionist temporal classification) Loss and Decoding for aligning sequences.
-
(5) Difficult Beam search Algorithm: commonly used by Speech-to-Text and NLP applications to enhance predictions
-
We proceeded according to the original plan: Note : Voice recognition will use common voice dataset and CTC and Beam search algorithms to train, but the training has not been completed yet.★ The biggest problem we encounter now is that we are still studying how to us e classification in actual files for feature block analysis.
We finished the project as planned above, and was awarded with best voice application high honor from the professor. The whole file, including LaTeX formal paper, presentation PPT, and the code of our project are all available in this repository.
Copyright© The homeworks and teaching materials is provided by the professor of the course, Nick Wang , and the answers in some code blocks is added by me.
Homework Description | link |
---|---|
Lab1: Intro to AI - Python tutorial | 🔗 |
Lab2: AI agent - nbgrader & search | 🔗 |
Lab3: AI gym - taxi problem | 🔗 |
Lab4: Multilayer Perceptron (MLP) learning | 🔗 |
Lab5: Convolutional neural network & pytorch | 🔗 |
Lab6: ros docker - msg & package & unittest | 🔗 |
Lab7: Transfer learning | 🔗 |
Lab8: Detection & segment - clutter maskrcnn | 🔗 |
Lab9: Reinforcement learning - Deep Q learning | 🔗 |
Lab10: Deterministic policy gradient - DDPG & RDPG | 🔗 |
Bonus: Slide and notes of the class 💯
- Watch MORE of my projects ➜ My GitHub repositories