You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the great lecture and implementation! As always, it was a pleasure.
I have tried to implement LlamaTokenizer (without using sentencepiece backend) staying as close to minbpe implementation as possible. Essentially it involves doing BPE on unicode, having utf-8 byte fallback and using character coverage to handle rare tokens doing training. The implementation is available here. I haven't made a pull request because it's still not EXACTLY the same as LlamaTokenizer. But I am hoping people can use it as a starting point.
Please refer to the README.md (point 6) for details on new functionality and caveats/TODOs
Best
Haris
The text was updated successfully, but these errors were encountered:
@karpathy
Thanks for the great lecture and implementation! As always, it was a pleasure.
I have tried to implement LlamaTokenizer (without using sentencepiece backend) staying as close to minbpe implementation as possible. Essentially it involves doing BPE on unicode, having utf-8 byte fallback and using character coverage to handle rare tokens doing training. The implementation is available here. I haven't made a pull request because it's still not EXACTLY the same as LlamaTokenizer. But I am hoping people can use it as a starting point.
Please refer to the README.md (point 6) for details on new functionality and caveats/TODOs
Best
Haris
The text was updated successfully, but these errors were encountered: