Hoping for training code #3

karrynest · 2024-12-20T07:50:18Z

It's an amazing work. Could you release your training code?

kehanlu · 2025-01-02T10:30:42Z

Hello @karrynest ,
Apologies for the delayed response. I’ve prepared the codes and a simple documentation for training and evaluating our model, which you can find in this branch.

Unfortunately, I am unable to share the raw audio files due to licensing restrictions. However, you can access the speech captions here. 😄

karrynest · 2025-01-14T12:24:34Z

Does the model support "voice chat mode", i.e. the user input is simply the audio without text, and the instructions from the users are inside the audio, like qwen-audio?

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
]},
{"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audios.append(librosa.load(
BytesIO(urlopen(ele['audio_url']).read()),
sr=processor.feature_extractor.sampling_rate)[0]
)

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

kehanlu · 2025-01-15T12:04:32Z

Hello @karrynest,
While the model is not specifically trained to generate responses via audio input, we believe it can support a voice chat mode similar to Qwen2-Audio.

In our initial attempt, we discovered that using system prompts can effectively guide the model's behavior. We recommend experimenting with different system prompts for your specific scenario!

If you achieve positive results, please let us know! 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hoping for training code #3

Hoping for training code #3

karrynest commented Dec 20, 2024

kehanlu commented Jan 2, 2025

karrynest commented Jan 14, 2025

kehanlu commented Jan 15, 2025

Hoping for training code #3

Hoping for training code #3

Comments

karrynest commented Dec 20, 2024

kehanlu commented Jan 2, 2025

karrynest commented Jan 14, 2025

kehanlu commented Jan 15, 2025