-
Notifications
You must be signed in to change notification settings - Fork 52
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 683a779
Showing
41 changed files
with
5,700 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
.idea | ||
*.pyc | ||
__pycache__/ | ||
*.sh | ||
local_tools/dtw |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Learning the Beauty in Songs: Neural Singing Voice Beautifier | ||
Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao | ||
|
||
Zhejiang University | ||
|
||
ACL 2022 Main conference | ||
|
||
--- | ||
[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2202.13277) | ||
[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/NeuralSVB)](https://github.com/MoonInTheRiver/NeuralSVB) | ||
![visitors](https://visitor-badge.glitch.me/badge?page_id=moonintheriver/NeuralSVB) | ||
|
||
<a href="https://neuralsvb.github.io" target="_blank">Project Page</a> | ||
|
||
<p align="center">:construction: :pick: :hammer_and_wrench: :construction_worker:</p> | ||
|
||
This repository is the official PyTorch implementation of our ACL-2022 [paper](https://arxiv.org/abs/2202.13277). Now, we release the codes for `SADTW` algorithm in our paper. Full version of our codes and data will be released at ACL-2022 conference (before June. 2022). Please star us and stay tuned! | ||
|
||
``` | ||
. | ||
|--modules | ||
|--voice_conversion | ||
|--dtw | ||
|--enhance_sadtw.py (Our algorithm) | ||
|--tasks | ||
|--singing | ||
|--pitch_alignment_task.py (Usage example) | ||
``` | ||
|
||
|
||
:rocket: **News**: | ||
- Feb.24, 2022: Our new work, NeuralSVB was accepted by ACL-2022. [Demo Page](https://neuralsvb.github.io). | ||
- Dec.01, 2021: Our recent work `DiffSinger` was accepted by AAAI-2022. [![](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger)](https://github.com/MoonInTheRiver/DiffSinger) [![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases) \| [![](https://img.shields.io/github/stars/NATSpeech/NATSpeech)](https://github.com/NATSpeech/NATSpeech). | ||
- Sep.29, 2021: Our recent work `PortaSpeech` was accepted by NeurIPS-2021. [![](https://img.shields.io/github/stars/NATSpeech/NATSpeech)](https://github.com/NATSpeech/NATSpeech). | ||
- May.06, 2021: We submitted DiffSinger to Arxiv [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446). | ||
|
||
|
||
## Abstract | ||
|
||
We are interested in a novel task, singing voice beautifying (SVB). Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre. Current automatic pitch correction techniques are immature, and most of them are restricted to intonation but ignore the overall aesthetic quality. Hence, we introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task, which adopts a conditional variational autoencoder as the backbone and learns the latent representations of vocal tone. In NSVB, we propose a novel time-warping approach for pitch correction: Shape-Aware Dynamic Time Warping (SADTW), which ameliorates the robustness of existing time-warping approaches, to synchronize the amateur recording with the template pitch curve. Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one. Extensive experiments on both Chinese and English songs demonstrate the effectiveness of our methods in terms of both objective and subjective metrics. | ||
|
||
<img align="center" src="resources/model_all7.png" style=" display: block; | ||
margin-left: auto; | ||
margin-right: auto; | ||
width: 100%;" /> | ||
<img align="center" src="resources/melhhh2.png" style=" display: block; | ||
margin-left: auto; | ||
margin-right: auto; | ||
width: 100%;" /> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
base_config: | ||
- egs/egs_bases/tts/base_zh.yaml | ||
- egs/egs_bases/singing/base.yaml | ||
raw_data_dir: 'data/raw/popbutfy_short_male_0.75' | ||
processed_data_dir: 'data/processed/popbutfy_0.75' | ||
binary_data_dir: 'data/binary/popbutfy_0.75' | ||
|
||
# binarization parameters | ||
num_spk: 100 | ||
binarization_args: | ||
with_spk_id: true | ||
reset_phone_dict: true | ||
reset_word_dict: true | ||
with_spk_embed: false | ||
with_wav: false | ||
with_linear: false | ||
with_f0cwt: false | ||
word_size: 1000 | ||
|
||
use_spk_embed: false | ||
use_spk_id: false | ||
use_ref_enc: false | ||
use_tech: true | ||
num_techs: 3 | ||
|
||
normalize_pitch: false | ||
|
||
# vocoder parameters | ||
vocoder: pwg | ||
vocoder_ckpt: '' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
base_config: | ||
- egs/egs_bases/vc/vc_ppg.yaml | ||
- base_text2mel.yaml | ||
|
||
binary_data_dir: 'data/binary/popbutfy_para_unseen_multispkemb' | ||
#binary_data_dir: 'data/binary/popcs_songs' | ||
|
||
task_cls: tasks.singing.svc_vae_task.SVCVAEGlobalTask | ||
use_energy: false | ||
|
||
# origin configs | ||
#lambda_mel_adv: 0.01 # | ||
max_tokens: 20000 | ||
max_frames: 5000 | ||
|
||
# vae parameters | ||
concurrent_ways: '' | ||
lambda_kl: 0.001 | ||
phase_1_steps: -1 | ||
phase_2_steps: 100000 | ||
max_updates: 200000 | ||
phase_1_concurrent_ways: 'p2p' | ||
phase_2_concurrent_ways: 'a2a,p2p' | ||
phase_3_concurrent_ways: 'a2p' | ||
cross_way_no_recon_loss: false | ||
cross_way_no_disc_loss: false | ||
disable_map: false | ||
|
||
latent_size: 128 | ||
fvae_enc_dec_hidden: 192 | ||
fvae_kernel_size: 5 | ||
fvae_enc_n_layers: 8 | ||
fvae_dec_n_layers: 4 | ||
|
||
frames_multiple: 4 | ||
|
||
# map parameters | ||
map_lr: 0.001 | ||
map_scheduler_params: | ||
gamma: 0.5 | ||
step_size: 60000 | ||
|
||
|
||
|
||
# vocoder parameters | ||
vocoder: hifigan | ||
vocoder_ckpt: 'checkpoints/1012_hifigan_all_songs_nsf' | ||
|
||
# asr parameters | ||
pretrain_asr_ckpt: 'checkpoints/1009_pretrain_asr' | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# task | ||
binary_data_dir: '' | ||
work_dir: '' # experiment directory. | ||
infer: false # infer | ||
amp: false | ||
seed: 1234 | ||
debug: false | ||
save_codes: [] | ||
# - configs | ||
# - modules | ||
# - tasks | ||
# - utils | ||
# - usr | ||
|
||
############# | ||
# dataset | ||
############# | ||
ds_workers: 1 | ||
test_num: 100 | ||
endless_ds: false | ||
sort_by_len: true | ||
|
||
######### | ||
# train and eval | ||
######### | ||
print_nan_grads: false | ||
load_ckpt: '' | ||
save_best: true | ||
num_ckpt_keep: 3 | ||
clip_grad_norm: 0 | ||
accumulate_grad_batches: 1 | ||
tb_log_interval: 100 | ||
num_sanity_val_steps: 5 # steps of validation at the beginning | ||
check_val_every_n_epoch: 10 | ||
val_check_interval: 2000 | ||
valid_monitor_key: 'val_loss' | ||
valid_monitor_mode: 'min' | ||
max_epochs: 1000 | ||
max_updates: 1000000 | ||
max_tokens: 31250 | ||
max_sentences: 100000 | ||
max_valid_tokens: -1 | ||
max_valid_sentences: -1 | ||
eval_max_batches: -1 | ||
test_input_dir: '' | ||
resume_from_checkpoint: 0 | ||
rename_tmux: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# task | ||
base_config: ../config_base.yaml | ||
task_cls: '' | ||
############# | ||
# dataset | ||
############# | ||
raw_data_dir: '' | ||
processed_data_dir: '' | ||
binary_data_dir: '' | ||
dict_dir: '' | ||
pre_align_cls: '' | ||
binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer | ||
mfa_version: 2 | ||
pre_align_args: | ||
nsample_per_mfa_group: 1000 | ||
txt_processor: en | ||
use_tone: true # for ZH | ||
sox_resample: false | ||
sox_to_wav: false | ||
allow_no_txt: false | ||
trim_sil: false | ||
denoise: false | ||
binarization_args: | ||
shuffle: false | ||
with_txt: true | ||
with_wav: false | ||
with_align: true | ||
with_spk_embed: false | ||
with_spk_id: true | ||
with_f0: true | ||
with_f0cwt: false | ||
with_linear: false | ||
with_word: true | ||
trim_eos_bos: false | ||
reset_phone_dict: true | ||
reset_word_dict: true | ||
word_size: 30000 | ||
pitch_extractor: parselmouth | ||
|
||
loud_norm: false | ||
endless_ds: true | ||
|
||
test_num: 100 | ||
min_frames: 0 | ||
max_frames: 1548 | ||
frames_multiple: 1 | ||
max_input_tokens: 1550 | ||
audio_num_mel_bins: 80 | ||
audio_sample_rate: 22050 | ||
hop_size: 256 # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate) | ||
win_size: 1024 # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate) | ||
fmin: 80 # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525]) | ||
fmax: 7600 # To be increased/reduced depending on data. | ||
fft_size: 1024 # Extra window size is filled with 0 paddings to match this parameter | ||
min_level_db: -100 | ||
ref_level_db: 20 | ||
griffin_lim_iters: 60 | ||
num_spk: 1 | ||
mel_vmin: -6 | ||
mel_vmax: 1.5 | ||
ds_workers: 1 | ||
|
||
######### | ||
# model | ||
######### | ||
dropout: 0.1 | ||
enc_layers: 4 | ||
dec_layers: 4 | ||
hidden_size: 256 | ||
num_heads: 2 | ||
enc_ffn_kernel_size: 9 | ||
dec_ffn_kernel_size: 9 | ||
ffn_act: gelu | ||
ffn_padding: 'SAME' | ||
use_spk_id: false | ||
use_split_spk_id: false | ||
use_spk_embed: false | ||
mel_loss: l1 | ||
|
||
|
||
########### | ||
# optimization | ||
########### | ||
lr: 2.0 | ||
scheduler: rsqrt # rsqrt|none | ||
warmup_updates: 8000 | ||
optimizer_adam_beta1: 0.9 | ||
optimizer_adam_beta2: 0.98 | ||
weight_decay: 0 | ||
clip_grad_norm: 1 | ||
clip_grad_value: 0 | ||
|
||
|
||
########### | ||
# train and eval | ||
########### | ||
use_word_input: false | ||
max_tokens: 30000 | ||
max_sentences: 100000 | ||
max_valid_sentences: 1 | ||
max_valid_tokens: 60000 | ||
valid_infer_interval: 10000 | ||
train_set_name: 'train' | ||
train_sets: '' | ||
valid_set_name: 'valid' | ||
test_set_name: 'test' | ||
num_test_samples: 0 | ||
num_valid_plots: 10 | ||
test_ids: [ ] | ||
vocoder: pwg | ||
vocoder_ckpt: '' | ||
vocoder_denoise_c: 0.0 | ||
profile_infer: false | ||
out_wav_norm: false | ||
save_gt: true | ||
save_f0: false | ||
gen_dir_name: '' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
base_config: ./base.yaml | ||
pre_align_args: | ||
txt_processor: zh | ||
binarizer_cls: data_gen.tts.binarizer_zh.ZhBinarizer | ||
word_size: 3000 |
Oops, something went wrong.