Implementation of LlamaTokenizer (without sentencepiece) #60

MaveriQ · 2024-03-26T10:39:10Z

Thanks for the great lecture and implementation! As always, it was a pleasure.

I have tried to implement LlamaTokenizer (without using sentencepiece backend) staying as close to minbpe implementation as possible. Essentially it involves doing BPE on unicode, having utf-8 byte fallback and using character coverage to handle rare tokens doing training. The implementation is available here. I haven't made a pull request because it's still not EXACTLY the same as LlamaTokenizer. But I am hoping people can use it as a starting point.

Please refer to the README.md (point 6) for details on new functionality and caveats/TODOs

Best
Haris

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of LlamaTokenizer (without sentencepiece) #60

Implementation of LlamaTokenizer (without sentencepiece) #60

MaveriQ commented Mar 26, 2024

Implementation of LlamaTokenizer (without sentencepiece) #60

Implementation of LlamaTokenizer (without sentencepiece) #60

Comments

MaveriQ commented Mar 26, 2024