You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have done my due diligence in trying to find the answer myself.
Topic
The PyTorch implementation
Question
moshi allways output unexpected answers eg.. Hello, what's going on?
here is my scripts:
fromhuggingface_hubimporthf_hub_downloadimporttorchimportosimportlibrosaimportnumpyasnpfromtqdmimporttqdmfrommoshi.moshi.modelsimportloaders, LMGenimportsoundfileassfimportnumpyasnpfromsubprocessimportcallimportsphnimportsentencepiecedevice=torch.device('cpu')
MODEL_PATH='/opt/ailab_mnt1/LLM_MODELS/moshi/moshika-pytorch-bf16'mimi_weight=os.path.join(MODEL_PATH, loaders.MIMI_NAME)
mimi=loaders.get_mimi(mimi_weight, device='cpu')
mimi.set_num_codebooks(8) # up to 32 for mimi, but limited to 8 for moshi.text_tokenizer=sentencepiece.SentencePieceProcessor(os.path.join(MODEL_PATH, loaders.TEXT_TOKENIZER_NAME))
moshi_weight=os.path.join(MODEL_PATH, loaders.MOSHI_NAME)
moshi=loaders.get_moshi_lm(moshi_weight, device=device)
lm_gen=LMGen(moshi, temp=0.8, temp_text=0.7) # this handles sampling params etc.# defsave_as_wav(y, sr, output_path):
sf.write(output_path, y, sr)
defone_round_test(audio_path):
wav, sample_sr=sphn.read(audio_path)
sample_rate=mimi.sample_ratewav=sphn.resample(
wav, src_sample_rate=sample_sr, dst_sample_rate=sample_rate
)
wav=torch.from_numpy(wav[None, :])
mimi.to(device)
# wave, sample_rate = torch.randn(1, 1, 24000 * 10), mimi.sample_ratewithtorch.no_grad():
codes=mimi.encode(wav.to(device)) # [B, K = 8, T]# decoded = mimi.decode(codes)# save_as_mp3(decoded.numpy().squeeze(), sample_rate, audio_path.replace('.mp3', '_decoded.wav'))# Supports streaming too.frame_size=int(mimi.sample_rate/mimi.frame_rate)
all_codes= []
# with mimi.streaming(batch_size=1):foroffsetintqdm(range(0, wav.shape[-1], frame_size), desc='mimi encoding...'):
frame=wav[:, :, offset: offset+frame_size]
ifframe.shape[-1] <frame_size:
continuecodes=mimi.encode(frame.to(device))
assertcodes.shape[-1] ==1, codes.shapeall_codes.append(codes)
# mimi.cuda()# moshi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MOSHI_NAME)out_wav_chunks= []
main_text= []
# Now we will stream over both Moshi I/O, and decode on the fly with Mimi.withtorch.no_grad(), lm_gen.streaming(1), mimi.streaming(1):
# with torch.no_grad():foridx, codeintqdm(enumerate(all_codes), desc='lm_gen steping...', total=len(all_codes)):
tokens_out=lm_gen.step(code.to(device))
# tokens_out is [B, 1 + 8, 1], with tokens_out[:, 1] representing the text token.iftokens_outisnotNone:
wav_chunk=mimi.decode(tokens_out[:, 1:])
out_wav_chunks.append(wav_chunk)
text_token=tokens_out[:, 0, 0][0].item()
iftext_tokennotin (0, 3):
_text=text_tokenizer.id_to_piece(text_token)
_text=_text.replace("▁", " ")
main_text.append(_text)
# print(idx, end='\r')out_wav=torch.cat(out_wav_chunks, dim=-1)
save_as_wav(out_wav.squeeze().cpu().numpy(), sample_rate, audio_path.replace('.wav', '_answer.wav'))
print('generated_text:')
print(''.join(main_text))
if__name__=='__main__':
wave_root='wave_data/tts_res'wave_files=os.listdir(wave_root)
forfileinwave_files:
audio_path=os.path.join(wave_root, file)
one_round_test(audio_path)
my audio files are generated using tts with the following questions:
questions= [
'At what temperature does water boil?',
'What is the largest organ in the human body?',
'Which is the largest planet in the solar system?',
'What is the approximate speed of light?',
'Who discovered the double helix structure of DNA?',
'What is the deepest ocean trench on Earth?',
'What is the normal human body temperature?',
'What is the first element in the periodic table?',
'n which year did humans first land on the moon?',
'What is the approximate total length of the Great Wall?',
'What is the highest mountain on Earth?',
'How many years ago did dinosaurs go extinct?',
'What is the smallest bone in the human body?',
'What is the longest river in the world?',
'How many times does the human heart beat per minute on average?',
]
when I using the first py file run moshi, I got this result:
note: I run the python script in cpu mode,but gpu mode also tested in online mode and got the very similar wav output
The text was updated successfully, but these errors were encountered:
The released weights for moshiko/moshika are trained as a voice assistant, and as such inference usually starts by the model greeting the user with something like "hello, what's going on?", so I guess that's the expected behavior for these weights.
Due diligence
Topic
The PyTorch implementation
Question
moshi allways output unexpected answers eg.. Hello, what's going on?
here is my scripts:
my audio files are generated using tts with the following questions:
when I using the first py file run moshi, I got this result:
note: I run the python script in cpu mode,but gpu mode also tested in online mode and got the very similar wav output
The text was updated successfully, but these errors were encountered: