Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creating custom tokenizer models in python #1740

Open
ctruexcytiva opened this issue Feb 24, 2025 · 1 comment
Open

creating custom tokenizer models in python #1740

ctruexcytiva opened this issue Feb 24, 2025 · 1 comment

Comments

@ctruexcytiva
Copy link

Hello,

I would like to create a custom tokenizer model using the library drain3 or use similar methods. Looking at the examples found in bindings/python/examples/custom_components.py , I see that it isn't hard to create a pre-tokenizer or normalizer, but couldn't follow the same pattern of using the custom method. would I just need to implement a class such as the following and initialize the tokenizer class with Tokenizer(MyCustomModelClass())?

class Drain3TokenizationModel:
    def __init__(self):
        pass

    def __getstate__(self, /):
        pass

    def __setstate__(self, /, state):
        pass

    def get_trainer(self):
        return Trainer()

    def id_to_token(self, id):
        return "token"
    
    def save(self, folder, prefix):
        return []

    def token_to_id(self, tokens):
        return 1

    def tokenize(self, sequence):
        return [tokenizers.Token(1, "token", (i, i+1)) for i, _ in enumerate(sequence)]

    @staticmethod
    def __new__(*args, **kwargs):
        return Drain3TokenizationModel()
@ctruexcytiva
Copy link
Author

Just to add to this, are the tokenizer models only implemented in rust and to add to this would require a rust implementation? Would it be better to try and make a subclass of the TokenizerBase class in python?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant