Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Exclude Top Choice (XTC) sampler #625

Closed
wants to merge 1 commit into from

Conversation

Cyrus-Hei
Copy link

A crude Python implementation of the XTC sampler introduced in oobabooga/text-generation-webui PR #6335.

This is the version which eos and newline tokens are excluded from the sampler. From my brief testing the sampler should be working, but it seems to slow down generation speed heavily when the sampler is activated, not sure how to fix that at the moment.

@baronrabban
Copy link

I am curious what type of slowdown you are seeing. I am running tensor parallel with multiple GPUs.

Cold evaluation with your XTC change:

(Queue: 0.0 s, Process: 0 cached tokens and 14038 new tokens at 618.25 T/s, Generate: 13.02 T/s, Context: 14038 tokens)

Two cold evaluations without your change:

(Queue: 0.0 s, Process: 0 cached tokens and 14038 new tokens at 623.05 T/s, Generate: 13.18 T/s, Context: 14038 tokens)

(Queue: 0.0 s, Process: 0 cached tokens and 14038 new tokens at 622.48 T/s, Generate: 13.22 T/s, Context: 14038 tokens)

So the slowdown was either 0.16 or 0.20 but I would not describe this as a heavy slowdown. Also, I think your XTC change is working and producing similar results to those I saw testing XTC in kobold.

@turboderp
Copy link
Member

You're not going to see much of a slowdown when the baseline is 13.2 t/s. It's an extra 1.2 ms/token of latency which would definitely be felt for smaller models. The right place to apply this would be right before the multinomial, just by scaling the sampling interval from 0..1 to x..1 where x is either a constant or adjusted based on what the top token is if (as seems to be the trend with these methods) you have to make exceptions to avoid completely breaking the model.

Why not use skew sampling, though? It's a very similar idea, only it's a smoother function.

XTC:

image

Skew:

image

@Cyrus-Hei
Copy link
Author

Different models reacts differently as far as I can tell, probably related to vocab size and how the implementation checks logits. At xtc probability 0.8, I noticed a drop of 31 T/s to 24 T/s for a Gemma 2 27B finetune at 5 bpw. And a drop of 38 T/s to 31 T/s for a Mistral Nemo upscale finetune (Theia 21B) at 6 bpw. On the other hand, I noticed only a drop of 36 T/s to 35 T/s on Mistral small Instruct 2409 6.5 bpw. These are all carried out on a single RTX 3090 at around 5k/16k context.

As a side note, I can't really tell if XTC is actually working to be honest, I cannot tell if the generation difference is from temperature or XTC (or my settings are just off for the models I am testing). And I am very much skeptical about using a probability to determine if the sampler should activate or not.

As for the adding exceptions part, I am also not sure if that is required, as the idea is to remove the set of tokens above the threshold, except the least probable one in the set. I would suppose if the reply should end, the EOS token would have a very high probability, hence likely making itself the only token above the threshold (thus keeping it). The exception is added only to prevent the breaking of larger models (70B+) as mentioned in the original PR, and the discussion is still ongoing on their side I suspect. My experience is only with smaller models (30B-), and they have been working completely fine without handling EOS as an exception, in both this implementation and that of KoboldCPP.

For the skew sampler, I want to suggest adding more documentations for it, I am as lost I could trying to figure out what values to put. And if I have to say what makes XTC better (or worse), is XTC's easier-to-understand parameters and its more rigorous attempt at completely eliminating top tokens when it activates, which forces the LLM harder to use more "creative" tokens.

@baronrabban
Copy link

At xtc probability 0.8

How did you select this? I believe the default is 0.5 which means XTC kicks in 50% of the time. At 0.8 I believe you're using XTC 80% of the time which perhaps leads to more of a slowdown than using it 50% of the time.

I am using a variant of Mistral Large, taking up 100GB VRAM. 0.8 temperature, 0.02 min P and all other samplers disabled. The results are quite good and there are no quirks. I put prints on the EOS/Newline section and it is definitely kicking in at the end of generation.

I can tell XTC is working, besides print statements, given the way it changes the story. I have some scenarios where I know how it's supposed to go and it goes that way pretty much every time but with XTC it's definitely changing things up in a good way.

@Cyrus-Hei
Copy link
Author

At 0.8 I believe you're using XTC 80% of the time which perhaps leads to more of a slowdown than using it 50% of the time.

It is more of an experimental value for testing purposes, and you are absolutely right that 0.8 slows down generation more than 0.5. The point I want to make is that this implementation would drop generation speed significantly on some models and settings, for example if a model suffers a 20% slow down at 0.8 XTC probability, it is safe to expect it to be slowed down around 10% at 0.5 XTC probability.

I would say this implementation is more for testing only, if more data supports the effectiveness of XTC, I might look into doing it in C++ if I have time, or someone else could do a new PR.

@aarongerber
Copy link

aarongerber commented Sep 19, 2024

Why not use skew sampling, though? It's a very similar idea, only it's a smoother function.
@turboderp Did you ask p-e-w on the original implementation or get an answer on this? My first thought is that it might allow for too many of the top choices. Still, I would love to see you explore this! :) If I had half your brains / knowledge on coding or math I would. The visual looks compelling. I am assuming it wouldn't eliminate the top choices but decrease their probability? It would be interesting if you could set the width/spread/range of the curve, and it's positioning. This would let you raise or lower the probability of top tokens two ways. Perhaps unneeded control? Anyway, thanks for all you do!

@turboderp
Copy link
Member

I am assuming it wouldn't eliminate the top choices but decrease their probability?

@aarongerber That is correct. It smoothly skews the distribution to the right and doesn't completely exclude top tokens. Here's the crazy thing though: that's the same thing you achieve by randomly switching XTC on and off. Reducing the probability of an outcome by 100% 50% of the time, is the same as reducing it by 50% 100% of the time. The sampler ultimately picks one token, and switching between different distributions doesn't change the fact that ultimately you have one final probability of picking any given token, and all of those together make up one distribution.

As for why people are trying all these crazy samplers in the first place, I think it really just comes down to entropy. Natural human language has some natural level of entropy, and so do pretrained models. But preserving it through finetuning is a difficult balancing act, and adding more randomness is a very crude way to try to un-collapse a model that's been damaged by overeager finetuning or a narrow focus on benchmark scores. Results are always going to be mixed.

@Downtown-Case
Copy link
Contributor

In theory, XTC does more what I want that skew: it chops off the top choice that tends to get the LLM into loops, while leaving the rest of the sampling untouched.

In practice, I'm having a hard time dialing it in, it either still loops or starts missing important "choice" tokens like names too much.

I have not played with skew a ton though... there was not much documentation for it, and I wasn't aware of how it actuallya ffects distribution until now!

@p-e-w
Copy link
Contributor

p-e-w commented Sep 20, 2024

@turboderp

Here's the crazy thing though: that's the same thing you achieve by randomly switching XTC on and off. Reducing the probability of an outcome by 100% 50% of the time, is the same as reducing it by 50% 100% of the time.

Perhaps I'm missing a deeper point you're trying to make here, but taken at face value, that isn't correct. With XTC, the "coin flip" happens for each token position. In other words, for half the tokens being generated, the distribution is chopped off, and for half of them it is left untouched. That is not the same as transforming every distribution to a lesser degree, and it doesn't average out to be the same in any obvious sense either, because different tokens are affected each time this happens.

Modifying every distribution often causes the output to slowly go off the rails, whereas XTC is metaphorically equivalent to occasionally pushing the model onto a different rail track.

On a side note, that's the first time I've ever heard of "skew sampling", and a web search didn't turn up any relevant references either. Where is this documented or implemented, and what exactly is the transformation being applied?

@mammour
Copy link

mammour commented Sep 20, 2024

On a side note, that's the first time I've ever heard of "skew sampling", and a web search didn't turn up any relevant references either. Where is this documented or implemented, and what exactly is the transformation being applied?

Skew sampling is already implemented in exl2 samplers.

the code can be found here : ext_sampling.cpp

line 194:

if (num_probs || (skew != 0.0f))
{
    num_candidates = pre_sort_descending(num_candidates, temp_probs, temp_indices);
    sort_descending(num_candidates, temp_probs, temp_indices, num_probs);
}

line 228:


        if (skew != 0.0f)
        {
            random_s = powf(random, expf(-skew));
        }

line 248:

        float random_s_adj = random_s * 0.9998;

        multinomial_cpu(num_candidates, temp_probs, temp_indices, random_s_adj);

o1 explanation, and a possible way for an optimization, maybe that last part is dumb I can't guess I'm still learning (❁ᴗ͈ˬᴗ͈)

@turboderp
Copy link
Member

@p-e-w

Perhaps I'm missing a deeper point you're trying to make here, but taken at face value, that isn't correct. With XTC, the "coin flip" happens for each token position. In other words, for half the tokens being generated, the distribution is chopped off, and for half of them it is left untouched.

Well, no. Using a coin toss to select between two different distributions is just a roundabout way of averaging them. Or in other words, if you have a top token with a probability of 70%, and you reduce that probability to zero half the time, the final likelihood of sampling that token is 35%.

And yes, skew works by raising the sorted, cumulative probabilities to the power of exp(skew_factor). This is equivalent to raising the multinomial sampling point to a power of exp(-skew_factor) which is how it's implemented. The effect is to smoothly skew the distribution to the right:

image

It's of course not the same as XTC, but it is in the same spirit of shifting the mass away from the top token. And of course all the other samplers still apply as a means to affect the shape in weird and wonderful ways. None of it is especially scientific either way.

@aarongerber
Copy link

@p-e-w

Perhaps I'm missing a deeper point you're trying to make here, but taken at face value, that isn't correct. With XTC, the "coin flip" happens for each token position. In other words, for half the tokens being generated, the distribution is chopped off, and for half of them it is left untouched.

Well, no. Using a coin toss to select between two different distributions is just a roundabout way of averaging them. Or in other words, if you have a top token with a probability of 70%, and you reduce that probability to zero half the time, the final likelihood of sampling that token is 35%.

And yes, skew works by raising the sorted, cumulative probabilities to the power of exp(skew_factor). This is equivalent to raising the multinomial sampling point to a power of exp(-skew_factor) which is how it's implemented. The effect is to smoothly skew the distribution to the right:

image

It's of course not the same as XTC, but it is in the same spirit of shifting the mass away from the top token. And of course all the other samplers still apply as a means to affect the shape in weird and wonderful ways. None of it is especially scientific either way.

But are you considering that this is a series? Not choosing a top token for the first token in a reply isn’t the same as not choosing the first toke in the second token position … is it? I’m not a math person but on surface it feels very different. If I don’t choose “Once”… “upon” won’t be the top choice in the phrase “once upon a time”… but if I do choose “Once” because it had a chance to exist… there is a 35% chance of “upon” being selected. In XTC… as a whole throughout general use it may have similar results to skew, but depending on when it kicks in I would think it might experience a different outcome.

But again I don’t know math in this area.

To me, I feel like XTC has the potential to add entropy to a place that samplers rarely target. The top option is always more likely regardless how much you pull other items off the other side. Even if I am right about the math being different for XTC then I hope Skew gets the chance for another variation to tweak token selection, and if you are right that it’s the same but smoother then I hope Skew gets a chance for a better implementation.

So much for not opening my mouth and removing all doubt I’m a fool :) thanks for reading.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 20, 2024

In my case for large, I am using .9 temp, .03 min_P and .05/.55 XTC. That is in the HF implementation though.

@baronrabban
Copy link

I was interested to understand how this sampler remains coherent even if it is chopping off all the top choices. So I ran some inferences and tracked exactly how much was getting chopped.

Using xtc_threshold 0.1 and xtc_probability 0.5

So 50% of the time XTC is not even running at all, so 0 top choices get chopped in those cases.

But what surprised me, is that out of 519 XTC evaluations which actually passed the coin flip, the median tokens to be removed is 0. The mean was 0.500, max was 4.

So for the full evaluation, about 75% of the time absolutely nothing is happening and in the cases where it is, the most likely thing is that a single top choice is being removed.

I am not saying any of this is good or bad it just surprised me as I was expecting dozens or hundreds of top choices to be removed each time but that definitely isn't the case. So I think the secret to its coherence is how little is being removed.

Also worth noting that you can run at much lower xtc_threshold's and it seems to produce reasonable outputs.

xtc_threshold = 0.01 median removed: 1 mean: 2.3 max: 20

xtc_threshold = 0.001 median removed: 3 mean: 6.2 max: 140

@turboderp
Copy link
Member

turboderp commented Sep 21, 2024

But are you considering that this is a series? Not choosing a top token for the first token in a reply isn’t the same as not choosing the first toke in the second token position … is it? I’m not a math person but on surface it feels very different.

It might be counterintuitive, idk, but switching the feature on and off, with some probability, accomplishes the same as sampling from a weighted average of the original distribution and the modified distribution.

As an example, suppose you have two 6-sided dice, one with faces [1, 2, 3, 4, 5, 6] and one where the faces are [4, 4, 5, 5, 6, 6], corresponding to distributions of [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] and [0, 0, 0, 1/3, 1/3, 1/3].

Then you toss a coin to select which die to roll. If you're rolling the first die, the probability of rolling a 1 will be 1/6. But you only roll the first die 50% of the time, so the probability of selecting that die with the coin toss and then rolling a 1 is 1/12. Similarly for each outcome:

1: 50% * 1/6 + 50% * 0 = 1/12
2: 50% * 1/6 + 50% * 0 = 1/12
3: 50% * 1/6 + 50% * 0 = 1/12
4: 50% * 1/6 + 50% * 1/3 = 1/4
5: 50% * 1/6 + 50% * 1/3 = 1/4
6: 50% * 1/6 + 50% * 1/3 = 1/4

And therefore you'd get the exact same behavior from a single loaded die with the distribution [1/12, 1/12, 1/12, 1/4, 1/4, 1/4].

XTC does the same thing. It isn't eliminating the top token(s) but reducing their overall probability by a factor of xtc_probability (if nitpicking, this is of course prior to renormalization). You could mix in the original distribution with a weight of 1 - xtc_probability and achieve exactly the same result. Not that you necessarily should, since it is actually more efficient to switch it off randomly (may or may not be the case in some batched implementations), but in terms of the expected outcome and how it relates to alternatives like skew, mixing and switching are completely equivalent.

The top option is always more likely regardless how much you pull other items off the other side

Yep, this was also the motivation for skew. All other samplers do some kind of flattening/sharpening and/or tail truncation, so the top token stands out more or less depending on various parameters but it always remains on top. I wanted to switch that up just as an experiment to see if it could add a different flavor of entropy. Or something.

Don't know if I'd call it a success. It predictably hurts performance (as measured by benchmarks etc.) and makes the model less intelligent. And like every other sampler it still becomes a very unsatisfying balancing act where too little skew does barely anything and too much makes the model incoherent. I attribute that to the fact that, at the end of the day, the problem is with the model, not with the sampler. And it seems like added randomness at the sampling stage just can't correct for overfitting. I suspect there just isn't enough information in the logits to allow for any sort of course correction that doesn't end up being destructive, so maybe you'd want to inject "creativity" on earlier layers. Perhaps adding noise to the attention scores would do something more something, or a "fatigue" bias that discourages the model from repeatedly attending to the same patterns.

@p-e-w
Copy link
Contributor

p-e-w commented Sep 21, 2024

@turboderp

The reason why simply reducing the top probabilities by half isn't equivalent to XTC's behavior is normalization. Here's an illustration using your example with dice:

XTC

  1. Original distribution: [1/6, 1/6, 1/6, 1/6, 1/6, 1/6].
  2. Applying XTC: [0, 0, 0, 1/6, 1/6, 1/6]. This isn't a probability distribution, so we need to renormalize. The result is [0, 0, 0, 1/3, 1/3, 1/3].
  3. Choosing between the two distributions with 50% probability gives a resulting distribution that is a weighted average as you correctly state, namely [1/12, 1/12, 1/12, 1/4, 1/4, 1/4].

Reducing top probabilities by 50%

  1. Original distribution: [1/6, 1/6, 1/6, 1/6, 1/6, 1/6].
  2. Reducing top probabilities by 50%: [1/12, 1/12, 1/12, 1/6, 1/6, 1/6]. This isn't a probability distribution, so we need to renormalize. The result is [1/9, 1/9, 1/9, 2/9, 2/9, 2/9].

[1/12, 1/12, 1/12, 1/4, 1/4, 1/4] is not the same distribution as [1/9, 1/9, 1/9, 2/9, 2/9, 2/9].

@turboderp
Copy link
Member

@p-e-w Yes, it's imprecise to not account for where in the process normalization takes place, which is why I was talking about mixing two normalized distributions in which case it doesn't matter. Point was correcting the misconception that switching between distributions is somehow materially different from mixing them. In either case you end up with a resulting curve that is some weighted sum of the two original curves.

If you want something precisely equivalent to XTC with a switching probability, you can work out the exact factor by which to scale the top probs to achieve that prior to normalization (based on the switching prob and the sum of the top prob(s)), or you can normalize the two distributions before mixing. Either way the point is (!) you end up with a single distribution where the previously-top choices have been scaled by some factor (not reduced to zero) and the lost probability mass is distributed over the rest of the options to maintain the sum.

@turboderp
Copy link
Member

Okay, so I had some time to actually look at this. And the way it's written in this PR, you wouldn't see any effect because the logits are modified after the extension function is called to pick a token.

I implemented another version in C++ with the latest commit, and even though it became a bit more involved than scaling the sampling point due to the excluded token list, the overhead is still hard to measure.

Note that because it apparently isn't meant to affect the top choice unless there's at least one other option over the threshold, there are many cases where it still doesn't have an effect. E.g. for most models, "why don't scientists trust" will be followed by "atoms" with about a 99% likelihood, and since the filter is applied after top-P, min-P etc., there are limits to what it will do in cases where the model is already very certain, unless you also disable truncation samplers.

That does also make it pretty stable, though, so you have to crank it up real high before gens become incoherent. At least with the models I was testing with (mostly Mistral).

@baronrabban
Copy link

I wanted to try it but there is something wrong in dev. I am using tensor parallel and Mistral Large. My reproducer is: https://gist.github.com/baronrabban/e03c203d7189444c1cab8bf0a8d03ed4

I stepped back through your commits to find where the trouble showed up. The last good commit is:
c4a03e0

The next commit just stack dumps for me with a floating point exception so I can't test it: e155e0a

Then the trouble shows up in the next commit, here: 9946f45

c4a03e0 produces very sane results with my reproducer:

Bob was a man looking for a job. As Bob rounded the corner he kept an eye out for the place he was sent to apply. He had spoken to the construction foreman on the phone, and the man had told Bob to meet him at the construction site. The foreman had told him that there would be a job for him if he got there on time. Bob checked his watch. He was early. He was always early.

9946f45 produces very insane results:

Bobby, Bob's childhood nick-named, grew older. Bob's childhood nick-named, Bobby, grew older. Bob wished Bobby's childhood nick-named, Bobby, grew older. Bob wished Bobby's childhood nick-named, Bobby, grew older. Bob wished Bobby's childhood nick-named, Bobby, grew older. Bob wished Bobby's childhood nick-named, Bobby, grew older. Bob wished Bobby's childhood nick-named, Bobby, grew older. Bob wished Bobby's childhood nick-named, Bobby, grew older. Bob wished Bobby's childhood nick-named, Bobby, grew older. Bob wished Bobby's childhood nick-named, Bobby, grew older.

That trouble is never reverted and continues to the latest commit in dev.

@p-e-w
Copy link
Contributor

p-e-w commented Sep 23, 2024

@turboderp

Note that because it apparently isn't meant to affect the top choice unless there's at least one other option over the threshold, there are many cases where it still doesn't have an effect. E.g. for most models, "why don't scientists trust" will be followed by "atoms" with about a 99% likelihood, and since the filter is applied after top-P, min-P etc., there are limits to what it will do in cases where the model is already very certain, unless you also disable truncation samplers.

Correct, and in fact a relatively large parameter range is viable. I've experimented with xtc_threshold between 0.05 and 0.25, and xtc_probability between 0.2 and 1.0. Many models remain coherent for almost any pair of parameter values from those ranges, and effects on model behavior range from "barely noticeable" to "altered beyond recognition".

As xtc_threshold approaches 0.5, the effect vanishes, and as xtc_probability approaches 0, the effect also vanishes, so there is a lot of room for adjustment if the output shows undesired artifacts.

@Cyrus-Hei
Copy link
Author

Okay, so I had some time to actually look at this. And the way it's written in this PR, you wouldn't see any effect because the logits are modified after the extension function is called to pick a token.

I guess that is one reason why I don't feel any difference (I thought the extension function is only there to apply multiple samplers), I suppose a Python implementation just isn't very feasible for XTC here since it is supposed to go after truncation samplers.

Anyway, thanks a lot for taking time to implement this in C++, your work has been tremendous to me and a lot of users. I will close this PR now that it has been implemented in the dev branch.

@Cyrus-Hei Cyrus-Hei closed this Sep 23, 2024
@turboderp
Copy link
Member

@baronrabban It had become necessary to abandon the safetensors library because it was causing way too many issues. I'm gradually fixing all the issues with the replacement loader, and currently (with the latest commit as of right now) it should be stable and usable for TP.

It may be that there are still some quirks left on Windows. Currently only testing on Linux right now. But give it a shot.

@baronrabban
Copy link

@turboderp

I confirm the dev branch is working fine now after your latest commit so I was able to spend time this evening with the XTC implementation, using tabbyAPI and SillyTavern.

To confirm I was actually using it I set the probability to 1 and the threshold to 0.000001 and the generation went bonkers, which was good.

I reset things back to normal 0.5 prob and 0.1 thresh and with Mistral Large it feels too subtle to me. Even 0.07 it's just hard to see how much it's doing.

I spent the most time with 0.05 threshold and there you can really see a difference. Where it stands out is swiping/regenerating. Normally, swiping produces pretty consistent results which is a letdown. But with XTC 0.5 0.05 the swipes produce very different results, but everything is still coherent.

I do not see any hit to performance. I think it's a good feature and people will enjoy it.

@turboderp
Copy link
Member

I think the real thing you'd want to compare it to, aside from maybe skew, would be something like:

top P = 0.8
temperature_last = True
temperature = 10
(nothing else)

I.e. select more or less at random between whatever tokens are within the top 80% of the distribution (or pick the top token if it is more than 80% likely). This should also keep generations coherent while providing a lot of "creativity".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants