Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hoping for training code #3

Open
karrynest opened this issue Dec 20, 2024 · 3 comments
Open

Hoping for training code #3

karrynest opened this issue Dec 20, 2024 · 3 comments

Comments

@karrynest
Copy link

It's an amazing work. Could you release your training code?

@kehanlu
Copy link
Owner

kehanlu commented Jan 2, 2025

Hello @karrynest ,
Apologies for the delayed response. I’ve prepared the codes and a simple documentation for training and evaluating our model, which you can find in this branch.

Unfortunately, I am unable to share the raw audio files due to licensing restrictions. However, you can access the speech captions here. 😄

@karrynest
Copy link
Author

Does the model support "voice chat mode", i.e. the user input is simply the audio without text, and the instructions from the users are inside the audio, like qwen-audio?

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
]},
{"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audios.append(librosa.load(
BytesIO(urlopen(ele['audio_url']).read()),
sr=processor.feature_extractor.sampling_rate)[0]
)

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

@kehanlu
Copy link
Owner

kehanlu commented Jan 15, 2025

Hello @karrynest,
While the model is not specifically trained to generate responses via audio input, we believe it can support a voice chat mode similar to Qwen2-Audio.

In our initial attempt, we discovered that using system prompts can effectively guide the model's behavior. We recommend experimenting with different system prompts for your specific scenario!

If you achieve positive results, please let us know! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants