Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model outputs � in Korean/Chinese #284

Closed
3 tasks
bqhuyy opened this issue Aug 2, 2024 · 10 comments
Closed
3 tasks

Model outputs � in Korean/Chinese #284

bqhuyy opened this issue Aug 2, 2024 · 10 comments
Assignees
Labels

Comments

@bqhuyy
Copy link

bqhuyy commented Aug 2, 2024

Issue description

Model outputs � in Korean/Chinese

Expected Behavior

Model can output correct Unicode/UTF8 character

Actual Behavior

Model outputs �

Steps to reproduce

This problem occurs when working with Chinese/Korean. I'm using Llama 3.1 - Q4_K_M. It also occurs with Qwen2 models.

const a1 = await session.prompt(q1, {
    onTextChunk(chunk) {
        process.stdout.write(chunk); // `chunk` returns �
    }
});

My Environment

Dependency Version
Operating System Windows 10
CPU AMD Ryzen 7 3700X
Node.js version v20.11.1
Typescript version 5.5.2
node-llama-cpp version 3.0.0-beta.40

Additional Context

I've tried to use onToken/onTextChunk function. It still returns same result. I see some related issues: ggml-org/llama.cpp#11 , ggml-org/llama.cpp#79

Relevant Features Used

  • Metal support
  • CUDA support
  • Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

@bqhuyy bqhuyy added bug Something isn't working requires triage Requires triaging labels Aug 2, 2024
@giladgd
Copy link
Contributor

giladgd commented Aug 2, 2024

Can you please provide me with a reproduction code?
Also, it seems that bartowski's model has some issues, I recommend you to use mradermacher's model instead.

@bqhuyy
Copy link
Author

bqhuyy commented Aug 5, 2024

@giladgd this is my code.

const modelPath = "path/to/model";

const llama = await getLlama();

const model = await llama.loadModel({modelPath: modelPath});

const context = await model.createContext({contextSize: 8192});

const session = new LlamaChatSession({
	contextSequence: context.getSequence(),
	systemPrompt: "",
});

session.prompt(msg, {
    onTextChunk: (chunk: string) => {
        console.log(chunk);
    },
});

I've tried mradermacher' models. It still outputs �. I'm running llama.cpp, it seems that the output does not contain �.
I've converted this model. It still meets the same � problem when running with node-llama-cpp.

@giladgd
Copy link
Contributor

giladgd commented Aug 6, 2024

@bqhuyy Are you sure it's not due to terminal encoding?
Can you try using the example Electron app? It should be more reliable against encoding issues

I've prompted mradermacher's model with 你好呀! and it responded to me with 你好!我是你的助手,很高兴能和你交流。有什么问题或需要帮助的地方吗?
Do you have a specific prompt you use that reproduces it that I can try?

@bqhuyy
Copy link
Author

bqhuyy commented Aug 7, 2024

@giladgd I found the issue. It seems that the problem occurs when outputs chunk (token), not the final output.

  • input:
긴 동화를 들려주세요
  • output:
동화는 다양한 종류가 있지만, 가장 유명한 동화를 몇 가지 소개해 드리겠습니다.

백설공주: 백설공주는 한 여왕의 딸로, 아름답고 착한 성품을 가지고 있습니다. 그녀는 한 번에 세 명의 매화와 함께 산으로 올라간 후, 그곳에서 매화와 사랑에 빠지게 됩니다.
동그랑땡: 동그랑땡은 작은 소년으로, 그의 이름은 그의 모양과 관련이 있습니다. 그는 자신의 집을 지키기 위해 다양한 방법을 사용합니다.
홍길동: 홍길동은 조선 시대에 살았던 유명한 여인간의 아들로, 그의 이름은 그의 용맹함과 기지로 유명합니다. 그는 여러 가지 모험을 하며 자신의 꿈을 실현하려고 합니다.
이러한 동화는 모두 다양한 문화와 전통에서 유래되었으며, 많은 사람들에게 사랑받고 있습니다. 동화를 들으면서도, 우리의 삶에 대한 교훈과 가르침을 얻을 수 있습니다.

혹시 특정 동화를 듣고 싶으신가요?
IMG_4139.mp4

@giladgd
Copy link
Contributor

giladgd commented Aug 7, 2024

@bqhuyy Thanks for investigating this :)
I managed to reproduce it now and found the cause.
I've created a PR for this (#293) and will release a new version with the fix soon.

Copy link

github-actions bot commented Aug 7, 2024

🎉 This issue has been resolved in version 3.0.0-beta.42 🎉

The release is available on:

Your semantic-release bot 📦🚀

@giladgd giladgd closed this as completed Aug 7, 2024
@giladgd giladgd self-assigned this Aug 7, 2024
@giladgd giladgd removed the requires triage Requires triaging label Aug 7, 2024
@bqhuyy
Copy link
Author

bqhuyy commented Aug 8, 2024

@giladgd hi, the problem still occurs when streaming in v3.0.0-beta.42
Screenshot 2024-08-08 101110

@giladgd
Copy link
Contributor

giladgd commented Aug 9, 2024

@bqhuyy It seems that the fix I released fixed it only when streaming in the console but not when streaming to other places (probably because split Unicode characters are spliced by the console when printed sequentially).
I've opened another PR (#295) that fixes it everywhere.
Thanks for letting me know :)

Copy link

github-actions bot commented Aug 9, 2024

🎉 This issue has been resolved in version 3.0.0-beta.43 🎉

The release is available on:

Your semantic-release bot 📦🚀

Copy link

github-actions bot commented Sep 24, 2024

🎉 This PR is included in version 3.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants