Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UTF-8 handling (including colors) #79

Merged
merged 1 commit into from
Mar 13, 2023

Conversation

kharvd
Copy link
Contributor

@kharvd kharvd commented Mar 13, 2023

Fixes #11 (including color handling). Largely based on #73 (props to @j-f1), but doesn't pull any additional dependencies.

Sample output from 13B model with default parameters:

main: prompt: '关于爱因斯坦的生平。他出生于'
main: number of tokens in prompt = 19
     1 -> ''
 31057 -> '关'
 30909 -> '于'
   234 -> '�'
   139 -> '�'
   180 -> '�'
 31570 -> '因'
 31824 -> '斯'
   232 -> '�'
   160 -> '�'
   169 -> '�'
 30210 -> '的'
 30486 -> '生'
 30606 -> '平'
 30267 -> '。'
 31221 -> '他'
 30544 -> '出'
 30486 -> '生'
 30909 -> '于'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


关于爱因斯坦的生平。他出生于1856年,就是一位德国化学家、天文学家和温谐器研究者。20世紀最初时期在高飞航母中被发现,爱因斯坦对此使用

Copy link
Contributor

@beiller beiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, for what that's worth, this is the cleanest and simplest fix for the UTF-8 issue.

Thank you @j-f1 for the initial idea of this approach #73

We have 3 PRs for this same issue FYI @ggerganov

Correction 4 PRs :)

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!
It would've taken me so much time to figure this out. Great teamwork 🦙

@ai2p
Copy link

ai2p commented Apr 11, 2023

Looks like on Windows it's still can't understand non-english input prompts with some non-latin symbols... Even after change of CMD locale codepage to UTF-8 with command 'chcp 65001'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode support
4 participants