-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
online mix noise audio data in training step #2622
base: master
Are you sure you want to change the base?
Conversation
Update upstream
No Taskcluster jobs started for this pull requestThe `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request. |
I tested it with Freesound Dataset Kaggle 2019 which has about 103h of noise data. |
DId u mean.. the noise dataset is small or Voxforge dataset is small, comparatively. |
I did mean the voxforge dataset. It has only around 32h of speech data. i think rnnoise dataset is smaller than the freesound one (6 vs 22 gb, did not find the length in hours). Also the noise files of rnnoise are in .raw format and freesound already has .wav format. So you need to convert them before to wav somehow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also think about replacing the cache() call with prefetch(tf.data.experimental.AUTOTUNE). For me it reduced the memory usage about 64gb without an impact on training speed.
… for memory cost [MOD] deprecate FLAGS.audio_aug_mix_noise_cache
To use |
@mychiux413 any idea how can be this done ? it should be online process ? |
you should prepare normalized noise files by yourself before training start. there is no standard way to normalize volume, I can only offer an example for you, you can optimize the script by yourself, and don't forget to listen the output audio to make sure everything sounds well. notice:
usage:
|
Could you add this script to your pull request? I added a progressbar and a summary to it, feel free to copy it back. The updated code is here: https://github.com/DanBmh/deepspeech-german/blob/master/data/normalize_noise_audio.py |
I added
Usage:
|
@mychiux413 anyway we can dump the mixed files and see how effective is the mixing of noise to speech file.just to make sure mixing is proper |
@alokprasad You're right, in fact, all the augmented audio should be able to be reviewed in pipeline, even augment on spectrogram like pitch/tempo/mask..., or we would not have a concept to tune the proper parameters. |
@mychiux413
but two problems i am facing
anyway if i listen the audio , i dont think noise is getting augmented to the speech at all. |
@alokprasad I tried tf.print and listened the audio, it's really augmented, maybe my default parameters are too conservative (because some noise data are "speech noise", I don't know what would them cause if too loud), and the process will not augment every single audio time step, but just randomly augment an interval for each audio, and many intervals in noise file are actually silence. Here is another tip, you can also try |
@mychiux413 "process will not augment every single audio time step, but just randomly augment an interval for each audio" I think this might not produce good result , i think each interval should be mixed with noise.( i.e complete file should be mixed with noise) Infact it would be good that same audio is fed twice to the network
i have added a flag in transcript.csv file with extra flag "noise_flag" whose value is 0 or 1 .
1 is to mix noise and 0 donto mix noise. relevant code changes
|
Do you mean augmenting with not only one but multiple background speech or noise files at once? If you dont think its to complicated this is an interesting idea. It would make the backgound noises even more realistic. In this case I would suggest to make the number not fixed, but random with an upper boundary to simulate different environments. |
Here’s a question: is it necessary to run augmentation on every epoch? It seems like augmentation is probably more valuable as the model nears convergence. I wonder if you could balance out the performance hit by not augmenting the first x epochs, when the model still has a high WER. |
…setest # Conflicts: # DeepSpeech.py # evaluate.py # util/feeding.py # util/flags.py
Here is my recent experiment result below (continuing...), I trained 20 epochs for every model with different parameters
The result shows:
So my conclusion is:
|
@mychiux413 How you are generating test samples ,is it natural voice with noisy background or you have mixed clean speech with noise and then using it as test wave? |
@alokprasad mixed clean speech with noise, using the new feature |
…g, add option to mix multi noise into one audio [MOD] change FLAGS name, gla iterations is optional
@mychiux413 Master changed quite a bit since you opened this in December. Could you rebase (and squash) it? |
@tilmankamp Maybe I should wait for it, the latest master did so much refactoring, the |
Whats the reason for the last commit (no-sort merge)? |
# Conflicts: # DeepSpeech.py # evaluate.py # training/deepspeech_training/util/feeding.py
@mychiux413 Sent you a pull request. |
I think @carlfm01 just did an incorrect push at some point. @mychiux413 should be able to just force-push over it. |
Merge current master for rebase to v0.7
Am I right that this is now outdated with @tilmankamp's merged pull request #2897? The overlay augmentation docs describe the same mixing features of noise and speech files. |
@DanBmh Unfortunately yes. Due to the massive amount of data that we plan to use for overlaying, things had to be tighter integrated with sample reading facilities in |
Maybe you could have informed us earlier, but I'm glad this feature is now in the master branch:) And you also did add some other interesting augmentations. Do you plan to use the noise augmentation for the next checkpoint release already? |
Just wanted to note that this PR still has an important feature which is missing in @tilmankamp's overlay implementation: The possibility to run tests with noise mixing. |
This needs rebasing anyway, but if someone wants to do it and address the issues, it's welcome |
Mixing noisy data into training file before runtime could cause data monotonicity, but mixing noisy data in runtime could cause very bad performance, if we read each noise audio to augment each training row. (Ex: for HDD disk, the duration of mixing one audio is almost 100 times slower than freq_time_mask does).
To reduce online mixing time, I use another tf.Dataset to cache noise audio array, then mix them to training data.
usage:
0~-10 db
, and divide noise audio with value between-25~-50 db
--audio_aug_mix_noise_walk_dirs
can set multi dirs with comma separated.To manually adjust volume loudness suppression:
--audio_aug_mix_noise_max_noise_db -15
,--audio_aug_mix_noise_min_noise_db -25
--audio_aug_mix_noise_max_noise_db -30
,--audio_aug_mix_noise_min_noise_db -50
, otherwise, the voice can have a chance to cover the main speaker's volume.--audio_aug_mix_noise_cache <your cache path>
, otherwise cache in memory.