(Reproduction of the paper SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training.)
Set up the environment using the following commands after preparing the SLAM-LLM environment:
pip install -r ./examples/s2s/requirements.txt
Alternatively, you can use our provided Docker image:
docker pull worstchan/slam-omni:v0
docker run -it --gpus all --name slam-omni worstchan/slam-omni:v0 /bin/bash
Our project supports two data formats: Parquet and JSONL. The open-source datasets are available on the Hugging Face Hub in Parquet format. Examples usage is provided in this notebook.
We provide three re-synthesized datasets for SLAM-Omni training:
- VoiceAssistant-400K: Single-round English dialogue dataset.
- UltraChat-300K: Multi-round English dialogue dataset.
- Belle_1.4M: Multi-round Chinese dialogue dataset.
You can load any of these datasets using the following code:
from datasets import load_dataset
# Replace "DATASET_NAME" with one of the following:
# - "worstchan/VoiceAssistant-400K-SLAM-Omni"
# - "worstchan/UltraChat-300K-SLAM-Omni"
# - "worstchan/Belle_1.4M-SLAM-Omni"
ds = load_dataset("DATASET_NAME")
We also support JSONL format for its concise structure. Below is an example:
{"key": "1", "source_wav": "/xxx/1.wav", "source_text": "Can you recommend some Chinese food for me?", "target_wav": "/xxx/1.wav", "target_text": "Sure! I recommend trying dumplings, Peking duck, and mapo tofu for a mix of flavors and textures in Chinese cuisine. These dishes offer a good balance of savory, spicy, and crispy elements."}
We reproduced the single-stage fine-tuning results of SLAM-Omni with a group size of 3. The following checkpoints are available for download:
- Single-Round Dialogue (English): Trained on VoiceAssistant-400K.
- Multi-Round Dialogue (English): Trained on VoiceAssistant-400K and UltraChat-300K.
- Multi-Round Dialogue (Chinese): Trained on Belle_1.4M.
You can pre-train the S2S model using TTS or ASR tasks with our provided scripts, though we recommend proceeding directly to fine-tuning. Alternatively, you may directly train a TTS or ASR model under the SLAM-Omni framework. For detailed instructions, refer to the pre-training README.
We provide two primary fine-tuning options for SLAM-Omni modeling:
# Fine-tune with grouping strategy (Recommended)
bash ./examples/s2s/scripts/finetune/finetune_s2s_group.sh
# Fine-tune without grouping
bash ./examples/s2s/scripts/finetune/finetune_s2s.sh
We also include scripts for reproducing Mini-Omni. Note that this requires the original VoiceAssistant-400K dataset for training:
bash ./examples/s2s/scripts/finetune/mini-omni/finetune_s2s.sh
Our framework theoretically supports all codec-based spoken dialogue model training. Simply re-synthesize the target tokens (e.g., CosyVoice2 tokens) during training for compatibility.
We provide scripts for both online and batch inference. You can use the trained model or the provided checkpoints for inference. For detailed guidance, refer to inference README.
Run the following commands for real-time inference:
# Multi-turn (Recommended)
bash ./examples/s2s/scripts/inference/inference_s2s_online_multi-round.sh
# Single-turn
bash ./examples/s2s/scripts/inference/inference_s2s_online.sh
For Mini-Omni modeling, use the following commands:
# Single-turn non-streaming
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_online.sh
# Single-turn streaming
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_online_stream.sh
For batch inference, ensure the data format matches the training format (Parquet or JSONL). Use the following commands:
# SLAM-Omni framework
bash ./examples/s2s/scripts/inference/inference_s2s_batch.sh
# Mini-Omni framework
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_batch.sh
- Add evaluation scripts.
- Add streaming inference scripts for SLAM-Omni.
SLAM-Omni:
@article{chen2024slam,
title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training},
author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others},
journal={arXiv preprint arXiv:2412.15649},
year={2024}
}
Mini-Omni:
@article{xie2024mini,
title={Mini-omni: Language models can hear, talk while thinking in streaming},
author={Xie, Zhifei and Wu, Changqiao},
journal={arXiv preprint arXiv:2408.16725},
year={2024}
}
- We borrow some code from Mini-Omni for SNAC-based modeling.
- We borrow some code from CosyVoice for the vocoder.
Our code is released under MIT License. The Chinese dialogue model is licensed under GPL-3.0 due to its use of Belle data and is intended for research purposes only.