#! https://zhuanlan.zhihu.com/p/517420568
Wave-U-Net is an end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account. We find that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music.
It is possible that the advantages stems from the upsampling that avoids aliasing, which should be further investigated.
The results indicate that there is room for increasing effectiveness and efficiency by further adapting the model size and other parameters, e.g. filter sizes, to the task and expanding to multi-channel audio and multi-source-separation.
Audio source separation refers to the problem of extracting one or more target sources while suppressing interfering sources and noise. Two related tasks are those of speech enhancement and singing voice separation, both of which involve extracting the human voice as a target source.
Time domain model: Wavenet, SEGAN
Wavenet has a non-causal conditional input and a parallel output of samples for each prediction and is based on the repeated application of dilated convolutions with exponentially increasing dilation factors to factor in context information.
SEGAN employs a neural network in the time-domain with an encoder and decoder pathway that successively halves and doubles the resolution of feature maps in each layer, respectively, and features skip connections between encoder and decoder layers.
The overall architecture is a one-dimensional U-Net with down and upsampling blocks.
Wave-U-Net uses a series of downsampling and upsampling blocks to make its predictions.
In applying the Wave-U-Net architecture to the application of speech enhancement, our objective is to separate a mixture waveform