-
Notifications
You must be signed in to change notification settings - Fork 85
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
25 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,26 @@ | ||
## 骆驼Tokenizer的计划 | ||
# 骆驼Tokenizer的计划 | ||
|
||
## 动机 | ||
|
||
+ 原来的LLaMA中文支持能力差 | ||
|
||
+ 并入中文的tokenizer和原空间不对齐,重训大量知识会混乱 | ||
|
||
+ 我们希望有一个中文支持更好的tokenizer,还是和原来的LLaMA对齐 | ||
|
||
## 目标 | ||
|
||
我们希望我们的Tokenizer有下面这些特征 | ||
|
||
+ 首先是空间对齐,比如"铁"这个字,和英语的"iron"能够在LLaMA原来的空间上对齐 | ||
|
||
+ 能够使用四角编码,泛化到任意的汉字 | ||
|
||
+ 高频的汉字能有独立的token,并且根据其高频的一些词汇,确定在LLaMA上对应的向量 | ||
|
||
+ | ||
|
||
## 检验方法 | ||
|
||
|
||
## |