Add Text Tokenizer #380
-
Feature RequestAdd a way to tokenize text so that it can be passed as an input (like logit_bias) for models Is your feature request related to a problem? Please describe.I am trying to use OpenAI APIs like completion, in that, there is an option to pass "logit_bias" but to currently there is no wat to generate the proper token of a text in order to pass in that. Describe the solution you'd like.Net implementation of OpenAI's Tokenizer Describe alternatives you've consideredThere is an existing MIT-licenced nuget package called GPT-3-Encoder-Sharp that does it. |
Beta Was this translation helpful? Give feedback.
Replies: 20 comments
-
Is there anything preventing you from using GPT-3-Encoder-Sharp with this package? |
Beta Was this translation helpful? Give feedback.
-
Honestly, I would rather OpanAI add an endpoint specifically to do this. They have their own tokenizer utility page that gives you an idea about how many tokens but the encoder is different per model. I may not pick up this issue only because it's a moving target and there's other nuget packages that can handle this task. |
Beta Was this translation helpful? Give feedback.
-
Even OpenAI its recommending one third package called gpt-3-encoder
We don't have to worry about changing/evolving encoder logic because the original encode by OpenAI was released 4 years ago and there were no changes to the encoder logic till now. The encoding logic of GPT-2 and GPT-3 is the same. |
Beta Was this translation helpful? Give feedback.
-
but encoder for gpt-4 is different |
Beta Was this translation helpful? Give feedback.
-
Not sure about gpt-4, (But I don't think so) but my point is that it won't change or evolve. If GPT-4 has a different encoding, we can write one-time encoding logic for GPT-4, and it will never change. |
Beta Was this translation helpful? Give feedback.
-
That's not what I heard |
Beta Was this translation helpful? Give feedback.
-
In either case, like I said, I won't be picking up this task, but PRs are always welcome. |
Beta Was this translation helpful? Give feedback.
-
Sure, I will do it over the weekend. |
Beta Was this translation helpful? Give feedback.
-
I still don't understand why the package you referenced before isn't a sufficient substitute? |
Beta Was this translation helpful? Give feedback.
-
Just want one OpenAI package to do everything related to OpenAI, that's all. It's up to you, feel free to close the issue. 🙂 |
Beta Was this translation helpful? Give feedback.
-
I'll leave it open if you plan to open a PR, I was just curious more than anything. |
Beta Was this translation helpful? Give feedback.
-
https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.tokenizers?view=ml-dotnet-preview |
Beta Was this translation helpful? Give feedback.
-
https://github.com/aiqinxuancai/TiktokenSharp Here's another good reference. I like that they're also pulling tiktoken |
Beta Was this translation helpful? Give feedback.
-
@StephenHodgson I referred the OpenAI's implementation, they also pull tokens from the blob. In their code, I found one interesting comment, it says, "# TODO: these will likely be replaced by an API endpoint". Now my question is that are you still open to have our own custom implementation or wait for the API endpoint? |
Beta Was this translation helpful? Give feedback.
-
Nice, looks like they took my suggestion seriously |
Beta Was this translation helpful? Give feedback.
-
I guess it doesn't hurt to do it. And then when the API becomes available replace it. |
Beta Was this translation helpful? Give feedback.
-
@logankilpatrick any internal support on adding API for tokenizer? |
Beta Was this translation helpful? Give feedback.
-
I think it's best to use either the Microsoft version or the slightly faster Tiktoken for the time being if optimization is needed. |
Beta Was this translation helpful? Give feedback.
-
I agree, I think the msft package should be easily integrable. I may consider adding it as a dependency. |
Beta Was this translation helpful? Give feedback.
-
I recommend using SharpToken because it is the fastest with lowest memory consumption thanks to my latest PR to that repository. Benchmark results:
|
Beta Was this translation helpful? Give feedback.
I recommend using SharpToken because it is the fastest with lowest memory consumption thanks to my latest PR to that repository.
Benchmark Code
Benchmark results: