-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
这个数据集是不是有点问题,使用merge.py的时候就会出问题 #58
Comments
遇到这个问题可能是哪个团队成员标注的时候,不小心在数据集末尾加了个逗号啥的,我们稍后检查一下哈 |
好的好的,非常感谢你的回答,还有一个问题就是我跑那个成功案例的模型的时候,用英文的数据集,会出现一个问题,是这个代码有点问题吗?
Traceback (most recent call last):
File "/home/cike/zzp/alpaca/chatglm_finetuning/train.py", line 121, in <module>
tokenizer, config, _,_ = dataHelper.load_tokenizer_and_config(tokenizer_class_name=ChatGLMTokenizer,config_class_name=ChatGLMConfig)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_helper.py", line 257, in load_tokenizer_and_config
tokenizer = load_tokenizer(tokenizer_name=tokenizer_name or model_args.tokenizer_name,
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_module.py", line 29, in load_tokenizer
tokenizer = class_name.from_pretrained(tokenizer_name, **tokenizer_kwargs)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained
return cls._from_pretrained(
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1958, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 211, in __init__
self.sp_tokenizer = SPTokenizer(vocab_file)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 32, in __init__
self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 65, in _build_text_tokenizer
self._configure_tokenizer(
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 61, in _configure_tokenizer
text_tokenizer.refresh()
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/icetk/text_tokenizer.py", line 31, in refresh
self.sp.Load(model_proto=self.proto.SerializeToString())
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/__init__.py", line 904, in Load
return self.LoadFromSerializedProto(model_proto)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/__init__.py", line 250, in LoadFromSerializedProto
return _sentencepiece.SentencePieceProcessor_LoadFromSerializedProto(self, serialized)
RuntimeError: Internal: [MASK] is already defined.
变质的水果糖
***@***.***
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2023年4月7日(星期五) 上午10:10
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [hikariming/alpaca_chinese_dataset] 这个数据集是不是有点问题,使用merge.py的时候就会出问题 (Issue #58)
遇到这个问题可能是哪个团队成员标注的时候,不小心在数据集末尾加了个逗号啥的,我们稍后检查一下哈
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
你好,有可能的,因为这个数据集是人工手打的,有时候标点等可能会出问题,导致合并出错。 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/init.py", line 293, in load
return loads(fp.read(),
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 112 column 1 (char 11779)
The text was updated successfully, but these errors were encountered: