Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
MoonInTheRiver committed Mar 1, 2022
0 parents commit 683a779
Show file tree
Hide file tree
Showing 41 changed files with 5,700 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.idea
*.pyc
__pycache__/
*.sh
local_tools/dtw
49 changes: 49 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Learning the Beauty in Songs: Neural Singing Voice Beautifier
Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao

Zhejiang University

ACL 2022 Main conference

---
[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2202.13277)
[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/NeuralSVB)](https://github.com/MoonInTheRiver/NeuralSVB)
![visitors](https://visitor-badge.glitch.me/badge?page_id=moonintheriver/NeuralSVB)

<a href="https://neuralsvb.github.io" target="_blank">Project&nbsp;Page</a>

<p align="center">:construction: :pick: :hammer_and_wrench: :construction_worker:</p>

This repository is the official PyTorch implementation of our ACL-2022 [paper](https://arxiv.org/abs/2202.13277). Now, we release the codes for `SADTW` algorithm in our paper. Full version of our codes and data will be released at ACL-2022 conference (before June. 2022). Please star us and stay tuned!

```
.
|--modules
|--voice_conversion
|--dtw
|--enhance_sadtw.py (Our algorithm)
|--tasks
|--singing
|--pitch_alignment_task.py (Usage example)
```


:rocket: **News**:
- Feb.24, 2022: Our new work, NeuralSVB was accepted by ACL-2022. [Demo Page](https://neuralsvb.github.io).
- Dec.01, 2021: Our recent work `DiffSinger` was accepted by AAAI-2022. [![](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger)](https://github.com/MoonInTheRiver/DiffSinger) [![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases) \| [![](https://img.shields.io/github/stars/NATSpeech/NATSpeech)](https://github.com/NATSpeech/NATSpeech).
- Sep.29, 2021: Our recent work `PortaSpeech` was accepted by NeurIPS-2021. [![](https://img.shields.io/github/stars/NATSpeech/NATSpeech)](https://github.com/NATSpeech/NATSpeech).
- May.06, 2021: We submitted DiffSinger to Arxiv [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446).


## Abstract

We are interested in a novel task, singing voice beautifying (SVB). Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre. Current automatic pitch correction techniques are immature, and most of them are restricted to intonation but ignore the overall aesthetic quality. Hence, we introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task, which adopts a conditional variational autoencoder as the backbone and learns the latent representations of vocal tone. In NSVB, we propose a novel time-warping approach for pitch correction: Shape-Aware Dynamic Time Warping (SADTW), which ameliorates the robustness of existing time-warping approaches, to synchronize the amateur recording with the template pitch curve. Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one. Extensive experiments on both Chinese and English songs demonstrate the effectiveness of our methods in terms of both objective and subjective metrics.

<img align="center" src="resources/model_all7.png" style=" display: block;
margin-left: auto;
margin-right: auto;
width: 100%;" />
<img align="center" src="resources/melhhh2.png" style=" display: block;
margin-left: auto;
margin-right: auto;
width: 100%;" />
30 changes: 30 additions & 0 deletions egs/datasets/audio/PopBuTFy/base_text2mel.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
base_config:
- egs/egs_bases/tts/base_zh.yaml
- egs/egs_bases/singing/base.yaml
raw_data_dir: 'data/raw/popbutfy_short_male_0.75'
processed_data_dir: 'data/processed/popbutfy_0.75'
binary_data_dir: 'data/binary/popbutfy_0.75'

# binarization parameters
num_spk: 100
binarization_args:
with_spk_id: true
reset_phone_dict: true
reset_word_dict: true
with_spk_embed: false
with_wav: false
with_linear: false
with_f0cwt: false
word_size: 1000

use_spk_embed: false
use_spk_id: false
use_ref_enc: false
use_tech: true
num_techs: 3

normalize_pitch: false

# vocoder parameters
vocoder: pwg
vocoder_ckpt: ''
51 changes: 51 additions & 0 deletions egs/datasets/audio/PopBuTFy/vae_global.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
base_config:
- egs/egs_bases/vc/vc_ppg.yaml
- base_text2mel.yaml

binary_data_dir: 'data/binary/popbutfy_para_unseen_multispkemb'
#binary_data_dir: 'data/binary/popcs_songs'

task_cls: tasks.singing.svc_vae_task.SVCVAEGlobalTask
use_energy: false

# origin configs
#lambda_mel_adv: 0.01 #
max_tokens: 20000
max_frames: 5000

# vae parameters
concurrent_ways: ''
lambda_kl: 0.001
phase_1_steps: -1
phase_2_steps: 100000
max_updates: 200000
phase_1_concurrent_ways: 'p2p'
phase_2_concurrent_ways: 'a2a,p2p'
phase_3_concurrent_ways: 'a2p'
cross_way_no_recon_loss: false
cross_way_no_disc_loss: false
disable_map: false

latent_size: 128
fvae_enc_dec_hidden: 192
fvae_kernel_size: 5
fvae_enc_n_layers: 8
fvae_dec_n_layers: 4

frames_multiple: 4

# map parameters
map_lr: 0.001
map_scheduler_params:
gamma: 0.5
step_size: 60000



# vocoder parameters
vocoder: hifigan
vocoder_ckpt: 'checkpoints/1012_hifigan_all_songs_nsf'

# asr parameters
pretrain_asr_ckpt: 'checkpoints/1009_pretrain_asr'

47 changes: 47 additions & 0 deletions egs/egs_bases/config_base.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# task
binary_data_dir: ''
work_dir: '' # experiment directory.
infer: false # infer
amp: false
seed: 1234
debug: false
save_codes: []
# - configs
# - modules
# - tasks
# - utils
# - usr

#############
# dataset
#############
ds_workers: 1
test_num: 100
endless_ds: false
sort_by_len: true

#########
# train and eval
#########
print_nan_grads: false
load_ckpt: ''
save_best: true
num_ckpt_keep: 3
clip_grad_norm: 0
accumulate_grad_batches: 1
tb_log_interval: 100
num_sanity_val_steps: 5 # steps of validation at the beginning
check_val_every_n_epoch: 10
val_check_interval: 2000
valid_monitor_key: 'val_loss'
valid_monitor_mode: 'min'
max_epochs: 1000
max_updates: 1000000
max_tokens: 31250
max_sentences: 100000
max_valid_tokens: -1
max_valid_sentences: -1
eval_max_batches: -1
test_input_dir: ''
resume_from_checkpoint: 0
rename_tmux: true
117 changes: 117 additions & 0 deletions egs/egs_bases/tts/base.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# task
base_config: ../config_base.yaml
task_cls: ''
#############
# dataset
#############
raw_data_dir: ''
processed_data_dir: ''
binary_data_dir: ''
dict_dir: ''
pre_align_cls: ''
binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer
mfa_version: 2
pre_align_args:
nsample_per_mfa_group: 1000
txt_processor: en
use_tone: true # for ZH
sox_resample: false
sox_to_wav: false
allow_no_txt: false
trim_sil: false
denoise: false
binarization_args:
shuffle: false
with_txt: true
with_wav: false
with_align: true
with_spk_embed: false
with_spk_id: true
with_f0: true
with_f0cwt: false
with_linear: false
with_word: true
trim_eos_bos: false
reset_phone_dict: true
reset_word_dict: true
word_size: 30000
pitch_extractor: parselmouth

loud_norm: false
endless_ds: true

test_num: 100
min_frames: 0
max_frames: 1548
frames_multiple: 1
max_input_tokens: 1550
audio_num_mel_bins: 80
audio_sample_rate: 22050
hop_size: 256 # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
win_size: 1024 # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
fmin: 80 # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
fmax: 7600 # To be increased/reduced depending on data.
fft_size: 1024 # Extra window size is filled with 0 paddings to match this parameter
min_level_db: -100
ref_level_db: 20
griffin_lim_iters: 60
num_spk: 1
mel_vmin: -6
mel_vmax: 1.5
ds_workers: 1

#########
# model
#########
dropout: 0.1
enc_layers: 4
dec_layers: 4
hidden_size: 256
num_heads: 2
enc_ffn_kernel_size: 9
dec_ffn_kernel_size: 9
ffn_act: gelu
ffn_padding: 'SAME'
use_spk_id: false
use_split_spk_id: false
use_spk_embed: false
mel_loss: l1


###########
# optimization
###########
lr: 2.0
scheduler: rsqrt # rsqrt|none
warmup_updates: 8000
optimizer_adam_beta1: 0.9
optimizer_adam_beta2: 0.98
weight_decay: 0
clip_grad_norm: 1
clip_grad_value: 0


###########
# train and eval
###########
use_word_input: false
max_tokens: 30000
max_sentences: 100000
max_valid_sentences: 1
max_valid_tokens: 60000
valid_infer_interval: 10000
train_set_name: 'train'
train_sets: ''
valid_set_name: 'valid'
test_set_name: 'test'
num_test_samples: 0
num_valid_plots: 10
test_ids: [ ]
vocoder: pwg
vocoder_ckpt: ''
vocoder_denoise_c: 0.0
profile_infer: false
out_wav_norm: false
save_gt: true
save_f0: false
gen_dir_name: ''
5 changes: 5 additions & 0 deletions egs/egs_bases/tts/base_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
base_config: ./base.yaml
pre_align_args:
txt_processor: zh
binarizer_cls: data_gen.tts.binarizer_zh.ZhBinarizer
word_size: 3000
Loading

0 comments on commit 683a779

Please sign in to comment.