Add Text Tokenizer #380

Vijay-Nirmal · 2023-03-16T08:40:23Z

Vijay-Nirmal
Mar 16, 2023

Feature Request

Add a way to tokenize text so that it can be passed as an input (like logit_bias) for models

Is your feature request related to a problem? Please describe.

I am trying to use OpenAI APIs like completion, in that, there is an option to pass "logit_bias" but to currently there is no wat to generate the proper token of a text in order to pass in that.

Describe the solution you'd like

.Net implementation of OpenAI's Tokenizer

Describe alternatives you've considered

There is an existing MIT-licenced nuget package called GPT-3-Encoder-Sharp that does it.

Answered by r-Larch

Apr 4, 2024

I recommend using SharpToken because it is the fastest with lowest memory consumption thanks to my latest PR to that repository.

Benchmark Code

Benchmark results:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.200
  [Host]               : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  .NET 6.0             : .NET 6.0.16 (6.0.1623.17311), X64 RyuJIT AVX2
  .NET 8.0             : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  .NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256

Method	Job	Runtime	Mean	Error	StdDev	Gen0	Gen1	Allocated
SharpToken	.NET 8.0	…

View full answer

StephenHodgson · 2023-03-16T13:31:11Z

StephenHodgson
Mar 16, 2023
Maintainer

Is there anything preventing you from using GPT-3-Encoder-Sharp with this package?

0 replies

StephenHodgson · 2023-03-16T13:33:25Z

StephenHodgson
Mar 16, 2023
Maintainer

Honestly, I would rather OpanAI add an endpoint specifically to do this. They have their own tokenizer utility page that gives you an idea about how many tokens but the encoder is different per model.

I may not pick up this issue only because it's a moving target and there's other nuget packages that can handle this task.

0 replies

Vijay-Nirmal · 2023-03-16T20:03:01Z

Vijay-Nirmal
Mar 16, 2023
Author

Even OpenAI its recommending one third package called gpt-3-encoder

I may not pick up this issue only because it's a moving target

We don't have to worry about changing/evolving encoder logic because the original encode by OpenAI was released 4 years ago and there were no changes to the encoder logic till now. The encoding logic of GPT-2 and GPT-3 is the same.

0 replies

StephenHodgson · 2023-03-16T20:03:38Z

StephenHodgson
Mar 16, 2023
Maintainer

but encoder for gpt-4 is different

0 replies

Vijay-Nirmal · 2023-03-16T20:06:22Z

Vijay-Nirmal
Mar 16, 2023
Author

Not sure about gpt-4, (But I don't think so) but my point is that it won't change or evolve. If GPT-4 has a different encoding, we can write one-time encoding logic for GPT-4, and it will never change.

0 replies

StephenHodgson · 2023-03-16T20:11:51Z

StephenHodgson
Mar 16, 2023
Maintainer

That's not what I heard
https://news.ycombinator.com/item?id=34008839

0 replies

StephenHodgson · 2023-03-16T20:12:44Z

StephenHodgson
Mar 16, 2023
Maintainer

In either case, like I said, I won't be picking up this task, but PRs are always welcome.

0 replies

Vijay-Nirmal · 2023-03-16T20:13:50Z

Vijay-Nirmal
Mar 16, 2023
Author

Sure, I will do it over the weekend.

0 replies

StephenHodgson · 2023-03-16T20:17:48Z

StephenHodgson
Mar 16, 2023
Maintainer

I still don't understand why the package you referenced before isn't a sufficient substitute?

0 replies

Vijay-Nirmal · 2023-03-16T20:22:05Z

Vijay-Nirmal
Mar 16, 2023
Author

Just want one OpenAI package to do everything related to OpenAI, that's all. It's up to you, feel free to close the issue. 🙂

0 replies

StephenHodgson · 2023-03-16T20:22:45Z

StephenHodgson
Mar 16, 2023
Maintainer

I'll leave it open if you plan to open a PR, I was just curious more than anything.

0 replies

StephenHodgson · 2023-03-18T23:16:13Z

StephenHodgson
Mar 18, 2023
Maintainer

https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.tokenizers?view=ml-dotnet-preview

0 replies

StephenHodgson · 2023-03-18T23:24:02Z

StephenHodgson
Mar 18, 2023
Maintainer

https://github.com/aiqinxuancai/TiktokenSharp

Here's another good reference. I like that they're also pulling tiktoken

0 replies

Vijay-Nirmal · 2023-03-25T12:56:05Z

Vijay-Nirmal
Mar 25, 2023
Author

@StephenHodgson I referred the OpenAI's implementation, they also pull tokens from the blob.

In their code, I found one interesting comment, it says, "# TODO: these will likely be replaced by an API endpoint". Now my question is that are you still open to have our own custom implementation or wait for the API endpoint?

0 replies

StephenHodgson · 2023-03-25T13:13:33Z

StephenHodgson
Mar 25, 2023
Maintainer

Nice, looks like they took my suggestion seriously

0 replies

StephenHodgson · 2023-03-25T13:14:48Z

StephenHodgson
Mar 25, 2023
Maintainer

I guess it doesn't hurt to do it. And then when the API becomes available replace it.

0 replies

StephenHodgson · 2023-04-13T15:05:24Z

StephenHodgson
Apr 13, 2023
Maintainer

@logankilpatrick any internal support on adding API for tokenizer?

0 replies

HavenDV · 2023-06-22T20:21:23Z

HavenDV
Jun 22, 2023

I think it's best to use either the Microsoft version or the slightly faster Tiktoken for the time being if optimization is needed.
https://github.com/microsoft/Tokenizer
https://github.com/tryAGI/Tiktoken

0 replies

StephenHodgson · 2023-12-12T00:23:38Z

StephenHodgson
Dec 12, 2023
Maintainer

I agree, I think the msft package should be easily integrable. I may consider adding it as a dependency.

0 replies

r-Larch · 2024-04-04T21:02:02Z

r-Larch
Apr 4, 2024

I recommend using SharpToken because it is the fastest with lowest memory consumption thanks to my latest PR to that repository.

Benchmark Code

Benchmark results:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.200
  [Host]               : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  .NET 6.0             : .NET 6.0.16 (6.0.1623.17311), X64 RyuJIT AVX2
  .NET 8.0             : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  .NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256

Method	Job	Runtime	Mean	Error	StdDev	Gen0	Gen1	Allocated
SharpToken	.NET 8.0	.NET 8.0	100.4 ms	1.95 ms	1.91 ms	2000.0000	-	22.13 MB
SharpToken	.NET 6.0	.NET 6.0	169.9 ms	2.42 ms	2.15 ms	24333.3333	1000.0000	196.3 MB
SharpToken	.NET Framework 4.7.1	.NET Framework 4.7.1	455.3 ms	8.34 ms	6.97 ms	34000.0000	1000.0000	204.39 MB

TiktokenSharp	.NET 8.0	.NET 8.0	211.4 ms	1.83 ms	1.53 ms	42000.0000	1000.0000	338.98 MB
TiktokenSharp	.NET 6.0	.NET 6.0	258.6 ms	5.09 ms	6.25 ms	39000.0000	1000.0000	313.26 MB
TiktokenSharp	.NET Framework 4.7.1	.NET Framework 4.7.1	638.3 ms	12.47 ms	16.21 ms	63000.0000	1000.0000	378.31 MB

TokenizerLib	.NET 8.0	.NET 8.0	124.4 ms	1.81 ms	1.60 ms	27250.0000	1000.0000	217.82 MB
TokenizerLib	.NET 6.0	.NET 6.0	165.5 ms	1.38 ms	1.16 ms	27000.0000	1000.0000	217.82 MB
TokenizerLib	.NET Framework 4.7.1	.NET Framework 4.7.1	499.7 ms	9.81 ms	14.07 ms	40000.0000	1000.0000	243.79 MB

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Text Tokenizer #380

{{title}}

Replies: 20 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Add Text Tokenizer #380

Vijay-Nirmal Mar 16, 2023

Feature Request

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Replies: 20 comments

StephenHodgson Mar 16, 2023 Maintainer

StephenHodgson Mar 16, 2023 Maintainer

Vijay-Nirmal Mar 16, 2023 Author

StephenHodgson Mar 16, 2023 Maintainer

Vijay-Nirmal Mar 16, 2023 Author

StephenHodgson Mar 16, 2023 Maintainer

StephenHodgson Mar 16, 2023 Maintainer

Vijay-Nirmal Mar 16, 2023 Author

StephenHodgson Mar 16, 2023 Maintainer

Vijay-Nirmal Mar 16, 2023 Author

StephenHodgson Mar 16, 2023 Maintainer

StephenHodgson Mar 18, 2023 Maintainer

StephenHodgson Mar 18, 2023 Maintainer

Vijay-Nirmal Mar 25, 2023 Author

StephenHodgson Mar 25, 2023 Maintainer

StephenHodgson Mar 25, 2023 Maintainer

StephenHodgson Apr 13, 2023 Maintainer

HavenDV Jun 22, 2023

StephenHodgson Dec 12, 2023 Maintainer

r-Larch Apr 4, 2024

Vijay-Nirmal
Mar 16, 2023

StephenHodgson
Mar 16, 2023
Maintainer

StephenHodgson
Mar 16, 2023
Maintainer

Vijay-Nirmal
Mar 16, 2023
Author

StephenHodgson
Mar 16, 2023
Maintainer

Vijay-Nirmal
Mar 16, 2023
Author

StephenHodgson
Mar 16, 2023
Maintainer

StephenHodgson
Mar 16, 2023
Maintainer

Vijay-Nirmal
Mar 16, 2023
Author

StephenHodgson
Mar 16, 2023
Maintainer

Vijay-Nirmal
Mar 16, 2023
Author

StephenHodgson
Mar 16, 2023
Maintainer

StephenHodgson
Mar 18, 2023
Maintainer

StephenHodgson
Mar 18, 2023
Maintainer

Vijay-Nirmal
Mar 25, 2023
Author

StephenHodgson
Mar 25, 2023
Maintainer

StephenHodgson
Mar 25, 2023
Maintainer

StephenHodgson
Apr 13, 2023
Maintainer

HavenDV
Jun 22, 2023

StephenHodgson
Dec 12, 2023
Maintainer

r-Larch
Apr 4, 2024