first commit

MoonInTheRiver · Mar 1, 2022 · 683a779 · 683a779
commit 683a779
Show file tree

Hide file tree

Showing 41 changed files with 5,700 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+.idea
+*.pyc
+__pycache__/
+*.sh
+local_tools/dtw
diff --git a/README.md b/README.md
@@ -0,0 +1,49 @@
+# Learning the Beauty in Songs: Neural Singing Voice Beautifier
+Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao
+
+Zhejiang University
+
+ACL 2022 Main conference
+
+---
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2202.13277)
+[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/NeuralSVB)](https://github.com/MoonInTheRiver/NeuralSVB)
+![visitors](https://visitor-badge.glitch.me/badge?page_id=moonintheriver/NeuralSVB)
+
+<a href="https://neuralsvb.github.io" target="_blank">Project&nbsp;Page</a>
+
+<p align="center">:construction: :pick: :hammer_and_wrench: :construction_worker:</p>
+
+This repository is the official PyTorch implementation of our ACL-2022 [paper](https://arxiv.org/abs/2202.13277). Now, we release the codes for `SADTW` algorithm in our paper. Full version of our codes and data will be released at ACL-2022 conference (before June. 2022). Please star us and stay tuned!
+
+```
+.
+|--modules
+    |--voice_conversion
+        |--dtw
+            |--enhance_sadtw.py  (Our algorithm)
+|--tasks
+    |--singing
+        |--pitch_alignment_task.py  (Usage example)
+```
+
+
+:rocket: **News**: 
+ - Feb.24, 2022: Our new work, NeuralSVB was accepted by ACL-2022. [Demo Page](https://neuralsvb.github.io).
+ - Dec.01, 2021: Our recent work `DiffSinger` was accepted by AAAI-2022. [![](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger)](https://github.com/MoonInTheRiver/DiffSinger) [![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases) \| [![](https://img.shields.io/github/stars/NATSpeech/NATSpeech)](https://github.com/NATSpeech/NATSpeech).
+ - Sep.29, 2021: Our recent work `PortaSpeech` was accepted by NeurIPS-2021. [![](https://img.shields.io/github/stars/NATSpeech/NATSpeech)](https://github.com/NATSpeech/NATSpeech). 
+ - May.06, 2021: We submitted DiffSinger to Arxiv [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446).
+
+
+## Abstract
+
+We are interested in a novel task, singing voice beautifying (SVB). Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre. Current automatic pitch correction techniques are immature, and most of them are restricted to intonation but ignore the overall aesthetic quality. Hence, we introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task, which adopts a conditional variational autoencoder as the backbone and learns the latent representations of vocal tone. In NSVB, we propose a novel time-warping approach for pitch correction: Shape-Aware Dynamic Time Warping (SADTW), which ameliorates the robustness of existing time-warping approaches, to synchronize the amateur recording with the template pitch curve. Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one. Extensive experiments on both Chinese and English songs demonstrate the effectiveness of our methods in terms of both objective and subjective metrics. 
+
+<img align="center" src="resources/model_all7.png" style="  display: block;
+  margin-left: auto;
+  margin-right: auto;
+  width: 100%;" />
+<img align="center" src="resources/melhhh2.png" style="  display: block;
+  margin-left: auto;
+  margin-right: auto;
+  width: 100%;" />
diff --git a/egs/datasets/audio/PopBuTFy/base_text2mel.yaml b/egs/datasets/audio/PopBuTFy/base_text2mel.yaml
@@ -0,0 +1,30 @@
+base_config:
+  - egs/egs_bases/tts/base_zh.yaml
+  - egs/egs_bases/singing/base.yaml
+raw_data_dir: 'data/raw/popbutfy_short_male_0.75'
+processed_data_dir: 'data/processed/popbutfy_0.75'
+binary_data_dir: 'data/binary/popbutfy_0.75'
+
+# binarization parameters
+num_spk: 100
+binarization_args:
+  with_spk_id: true
+  reset_phone_dict: true
+  reset_word_dict: true
+  with_spk_embed: false
+  with_wav: false
+  with_linear: false
+  with_f0cwt: false
+word_size: 1000
+
+use_spk_embed: false
+use_spk_id: false
+use_ref_enc: false
+use_tech: true
+num_techs: 3
+
+normalize_pitch: false
+
+# vocoder parameters
+vocoder: pwg
+vocoder_ckpt: ''
diff --git a/egs/datasets/audio/PopBuTFy/vae_global.yaml b/egs/datasets/audio/PopBuTFy/vae_global.yaml
@@ -0,0 +1,51 @@
+base_config:
+  - egs/egs_bases/vc/vc_ppg.yaml
+  - base_text2mel.yaml
+
+binary_data_dir: 'data/binary/popbutfy_para_unseen_multispkemb'
+#binary_data_dir: 'data/binary/popcs_songs'
+
+task_cls: tasks.singing.svc_vae_task.SVCVAEGlobalTask
+use_energy: false
+
+# origin configs
+#lambda_mel_adv: 0.01  #
+max_tokens: 20000
+max_frames: 5000
+
+# vae parameters
+concurrent_ways: ''
+lambda_kl: 0.001
+phase_1_steps: -1
+phase_2_steps: 100000
+max_updates: 200000
+phase_1_concurrent_ways: 'p2p'
+phase_2_concurrent_ways: 'a2a,p2p'
+phase_3_concurrent_ways: 'a2p'
+cross_way_no_recon_loss: false
+cross_way_no_disc_loss: false
+disable_map: false
+
+latent_size: 128
+fvae_enc_dec_hidden: 192
+fvae_kernel_size: 5
+fvae_enc_n_layers: 8
+fvae_dec_n_layers: 4
+
+frames_multiple: 4
+
+# map parameters
+map_lr: 0.001
+map_scheduler_params:
+  gamma: 0.5
+  step_size: 60000
+
+
+
+# vocoder parameters
+vocoder: hifigan
+vocoder_ckpt: 'checkpoints/1012_hifigan_all_songs_nsf'
+
+# asr parameters
+pretrain_asr_ckpt: 'checkpoints/1009_pretrain_asr'
+
diff --git a/egs/egs_bases/config_base.yaml b/egs/egs_bases/config_base.yaml
@@ -0,0 +1,47 @@
+# task
+binary_data_dir: ''
+work_dir: '' # experiment directory.
+infer: false # infer
+amp: false
+seed: 1234
+debug: false
+save_codes: []
+#  - configs
+#  - modules
+#  - tasks
+#  - utils
+#  - usr
+
+#############
+# dataset
+#############
+ds_workers: 1
+test_num: 100
+endless_ds: false
+sort_by_len: true
+
+#########
+# train and eval
+#########
+print_nan_grads: false
+load_ckpt: ''
+save_best: true
+num_ckpt_keep: 3
+clip_grad_norm: 0
+accumulate_grad_batches: 1
+tb_log_interval: 100
+num_sanity_val_steps: 5  # steps of validation at the beginning
+check_val_every_n_epoch: 10
+val_check_interval: 2000
+valid_monitor_key: 'val_loss'
+valid_monitor_mode: 'min'
+max_epochs: 1000
+max_updates: 1000000
+max_tokens: 31250
+max_sentences: 100000
+max_valid_tokens: -1
+max_valid_sentences: -1
+eval_max_batches: -1
+test_input_dir: ''
+resume_from_checkpoint: 0
+rename_tmux: true
diff --git a/egs/egs_bases/tts/base.yaml b/egs/egs_bases/tts/base.yaml
@@ -0,0 +1,117 @@
+# task
+base_config: ../config_base.yaml
+task_cls: ''
+#############
+# dataset
+#############
+raw_data_dir: ''
+processed_data_dir: ''
+binary_data_dir: ''
+dict_dir: ''
+pre_align_cls: ''
+binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer
+mfa_version: 2
+pre_align_args:
+  nsample_per_mfa_group: 1000
+  txt_processor: en
+  use_tone: true # for ZH
+  sox_resample: false
+  sox_to_wav: false
+  allow_no_txt: false
+  trim_sil: false
+  denoise: false
+binarization_args:
+  shuffle: false
+  with_txt: true
+  with_wav: false
+  with_align: true
+  with_spk_embed: false
+  with_spk_id: true
+  with_f0: true
+  with_f0cwt: false
+  with_linear: false
+  with_word: true
+  trim_eos_bos: false
+  reset_phone_dict: true
+  reset_word_dict: true
+word_size: 30000
+pitch_extractor: parselmouth
+
+loud_norm: false
+endless_ds: true
+
+test_num: 100
+min_frames: 0
+max_frames: 1548
+frames_multiple: 1
+max_input_tokens: 1550
+audio_num_mel_bins: 80
+audio_sample_rate: 22050
+hop_size: 256  # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
+win_size: 1024  # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
+fmin: 80  # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
+fmax: 7600  # To be increased/reduced depending on data.
+fft_size: 1024  # Extra window size is filled with 0 paddings to match this parameter
+min_level_db: -100
+ref_level_db: 20
+griffin_lim_iters: 60
+num_spk: 1
+mel_vmin: -6
+mel_vmax: 1.5
+ds_workers: 1
+
+#########
+# model
+#########
+dropout: 0.1
+enc_layers: 4
+dec_layers: 4
+hidden_size: 256
+num_heads: 2
+enc_ffn_kernel_size: 9
+dec_ffn_kernel_size: 9
+ffn_act: gelu
+ffn_padding: 'SAME'
+use_spk_id: false
+use_split_spk_id: false
+use_spk_embed: false
+mel_loss: l1
+
+
+###########
+# optimization
+###########
+lr: 2.0
+scheduler: rsqrt # rsqrt|none
+warmup_updates: 8000
+optimizer_adam_beta1: 0.9
+optimizer_adam_beta2: 0.98
+weight_decay: 0
+clip_grad_norm: 1
+clip_grad_value: 0
+
+
+###########
+# train and eval
+###########
+use_word_input: false
+max_tokens: 30000
+max_sentences: 100000
+max_valid_sentences: 1
+max_valid_tokens: 60000
+valid_infer_interval: 10000
+train_set_name: 'train'
+train_sets: ''
+valid_set_name: 'valid'
+test_set_name: 'test'
+num_test_samples: 0
+num_valid_plots: 10
+test_ids: [ ]
+vocoder: pwg
+vocoder_ckpt: ''
+vocoder_denoise_c: 0.0
+profile_infer: false
+out_wav_norm: false
+save_gt: true
+save_f0: false
+gen_dir_name: ''
diff --git a/egs/egs_bases/tts/base_zh.yaml b/egs/egs_bases/tts/base_zh.yaml
@@ -0,0 +1,5 @@
+base_config: ./base.yaml
+pre_align_args:
+  txt_processor: zh
+binarizer_cls: data_gen.tts.binarizer_zh.ZhBinarizer
+word_size: 3000