Reading notes of speech or deep learning related papers, including Automatic Speech Recognition (ASR), Speech Enhancement and Dereverberation (SED), Speech Separation (SS), Sound Source Localization (SSL) and some other speech signal processing topics.
[Overview] Deep Learning for Audio Signal Processing [note]
- Continuous Speech Separation with Conformer [note]
- Distortion-controlled training for E2E reverberant speech separation with aux autoencoding loss [note]
- Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation [note]
- DPTNet: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation [note]
- DPRNN: long sequence modeling for time-domain single-channel speech separation [note]
- Interrupted and Cascaded Permutation Invariant Training for Speech Separation [note]
- Unified Gradient Reweighting for Model Biasing with Applications to Source Separation [note]
- Mining Hard Samples Locally And Globally For Improved Speech Separation [note]
- On The Compensation Between Magnitude and Phase in Speech Separation [note]
- On the Use of Deep Mask Estimation Module for Neural Source Separation Systems [note]
- Recursive speech separation for unknown number of speakers [note]
- Rethinking the Separation Layers in Speech Separation Networks [note]
- SpEx: Multi-Scale Time Domain Speaker Extraction Network [note]
- Sudo rm -rf: Efficient Networks for Universal Audio Source Separation [note]
- SFSRNet: Super-Resolution for Single-Channel Audio Source Separation [note]
- Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers [note]
- Unsupervised Sound Separation Using Mixture Invariant Training [note]
- Voice Separation with an Unknown Number of Multiple Speakers [note]
- VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking [note]
- VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition [note]
- Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation [note]
- TasNet [note]
- TasNet [note]
- A comprehensive study of speech separation: spectrogram vs waveform separation [note]
- An End-to-End Deep Learning Framework For Multiple Audio Source Separation And Localization [note]
- Beam-TasNet: Time-domain Audio Separation Network Meets Frequency-domain Beamformer [note]
- Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output [note]
- Channel-Attention Dense U-Net for Multichannel Speech Enhancement [note]
- Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information [note]
- Complex neural spatial filter: Enhancing multi-channel target speech separation in complex domain [note]
- Distance-Based Sound Separation [note]
- Embedding and Beamforming: All-neural Causal Beamformer for Multichannel Speech Enhancement [note]
- End-to-End Multi-Channel Speech Separation [note]
- Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation [note]
- FaSNet: Low-latency Adaptive Beamforming for Multi-microphone Audio Processing [note]
- [FaSNet-TAC] E2E Microphone Permutation and Number Invariant Multi-channel Speech Separation [note]
- Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network [note]
- ADL-MVDR: All deep learning MVDR beamformer for target speech separation [note]
- Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation [note]
- MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation [note]
- Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform Domain [note]
- Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam [note]
- Inter-channel Conv-TasNet for multichannel speech enhancement [note]
- Localization Based Sequential Grouping for Continuous Speech Separation [note]
- Location-based training for multi-channel talker-independent speaker separation [note]
- Multi-Microphone Speaker Separation based on Deep DOA Estimation [note]
- Multi-band PIT and Model Integration for Improved Multi-channel Speech Separation [note]
- One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement [note]
- Real-time binaural speech separation with preserved spatial cues [note]
- Separating Varying Numbers of Sources with Auxiliary Autoencoding Loss [note]
- Spatial Loss for Unsupervised Multi-channel Source Separation [note]
- The Cone of Silence- Speech Separation by Localization [note]
- Real-time binaural speech separation with preserved spatial cues [note]
- Binaural Speech Separation of Moving Speakers With Preserved Spatial Cues [note]
- Online Binaural Speech Separation Of Moving Speakers With A Wavesplit Network [note]
- FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement [note]
- Deep Neural Mel-Subband Beamformer for In-car Speech Separation [note]
- A two-stage U-Net for high-fidelity denoising of historical recordings [note]
- DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement [note]
- Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters [note]
- Differentiable Consistency Constraints for Improved Deep Speech Enhancement [note]
- Funnel DCU for Phase-Aware Speech Enhancement [note]
- Improved Speech Enhancement with the Wave-U-Net [note]
- Interactive Speech and Noise Modeling for Speech Enhancement [note]
- Phase-aware Speech Enhancement with Deep Complex U-Net [note]
- Real Time Speech Enhancement in the Waveform Domain [note]
- SE-Conformer: Time-Domain Speech Enhancement using Conformer [note]
- Selector-Enhancer: Learning Dynamic Selection of Local and Non-local Attention Operation for Speech Enhancement [note]
- Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising [note]
- Uformer: A Unet based dilated complex & real dual-path conformer network [note]
- TasNet [note]
- A Survey of Sound Source Localization with Deep Learning Methods, The Journal of the Acoustical Society of America, 2022 [paper] [note]
- SLoClas: A Database for Joint Sound Localization and Classification, 2021 [paper] [note]
- A Time-domain Unsupervised Learning Based Sound Source Localization Method [note]
- Adaptation of Multiple Sound Source Localization Neural Networks with Weak Supervision and Domain-Adversarial Training [note]
- Broadband DOA Estimation using CNN trained with noise signals [note]
- CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings [note]
- Deep Neural Network for Multiple Speaker Detection and Localization [note]
- Deep Learning Based Two-dimensional Speaker Localization With Large Ad-hoc Microphone Arrays [note]
- DOA estimation for multiple sound sources using CRNN [note]
- End-to-end Binaural Sound Localisation from the Raw Waveform [note]
- Localization, detection and tracking of multiple moving sound sources with a convolutional recurrent neural network [note]
- Multi-speaker DOA estimation using deep CNN trained with noise signals [note]
- Multi-speaker localization using CNN trained with noise [note]
- Multi-task Neural Network for Robust Multiple Speaker Embedding Extraction [note]
- Neural Network Adaptation and Data Augmentation for Multi-Speaker Direction-of-Arrival Estimation [note]
- Robust DOA Estimation Based on Convolutional Neural Network and Time-Frequency Masking [note]
- Robust Source Counting and DOA Estimation Using Spatial Pseudo-Spectrum and Convolutional Neural Network [note]
- Robust TDOA Estimation Based on Time-Frequency Masking and Deep Neural Networks [note]
- Sound Event Localization and Detection of Overlapping Sources Using CRNN [note]
- TDOA estimation using DNN with TF Mask [note]
- TasNet [note]
- A Real-Time Speaker Diarization System Based on Spatial Spectrum [note]
- Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition [note]
- Determining Number of Speakers from Single Microphone Speech Signals by Multi-Label CNN [note]
- High-Resolution Speaker Counting in Reverberant Rooms Using CRNN with Ambisonics Features [note]
- Real-time Speaker counting in cocktail party scenario using Attention-guided CNN [note]
- Conformer: Convolution-augmented Transformer for Speech Recognition [note]
- Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models [note]