-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] Mixtral generates nonsense #570
Comments
Generate is not a good way to check a model is running properly. Can you run the following and share the results? a_large_chunk_of_text = "Generate is not a good way to check a model is running properly. Can you run the following and share the results?"
loss = model(a_large_chunk_of_text, return_type="loss")
print(loss.item()) |
This could be related to this issue: #591 . If I were you, I'd |
Are you adding or missing BOS tokens? When we accepted a PR for this, I
assume it was working...
…On Tue, May 21, 2024, 8:58 PM Joel Burget ***@***.***> wrote:
Still seems quite bad even when verifying that I have opt_einsum
installed. For good measure I checked a few more times and got 5.5, 6.7,
and 8.7. So note quite as bad 12.5 but still problematic.
Screenshot.2024-05-21.at.12.55.11.PM.png (view on web)
<https://github.com/TransformerLensOrg/TransformerLens/assets/310981/c22250e0-eb7e-491c-97cc-5230db773bf4>
—
Reply to this email directly, view it on GitHub
<#570 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQPMYZ5T4T4JI74FA7UV7ZDZDORODAVCNFSM6AAAAABHGYQMD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGM2DINJXGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I'm not using any special tokens at all. The |
Ahh sorry. Sometimes language models expect a special prepended token. You
can often find examples of this in model cards on huggingface. I'm not sure
but this could be the issue here.
…On Tue, May 21, 2024, 9:42 PM Joel Burget ***@***.***> wrote:
I'm not using any special tokens at all. The a_large_chunk_of_text code
block is what you pasted verbatim.
—
Reply to this email directly, view it on GitHub
<#570 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQPMYZ36YLQK6DK7WXHTMZ3ZDOWT5AVCNFSM6AAAAABHGYQMD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGQYDQNJSGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Unfortunately I don't think that's the issue. Here are the instructions from the Mixtral HF repo: Running those in a Colab we can see that the only special token they prepend is BOS (1): TransformerLens of course does this by default but I double-checked: adding |
I wonder if this is a memory issue. The only difference between the config for TransformerLens and HuggingFace is |
I don't see how memory could explain that? Low memory shouldn't silently
lead to issues, it should lead to out of memory errors.
Mathematically, TL and HF *should *be doing the same operations, and this
discrepancy implies they are not, and it seems important to understand why.
The obvious hypothesis is that it's because of the mixture of
expert routing. This amplifies small errors because it's discrete, maybe.
So it would be good to compare the routing on HuggingFace and the
routing in TL and see how often they disagree in these generations.
…On Tue, 4 Jun 2024 at 02:08, Bryce Meyer ***@***.***> wrote:
I wonder if this is a memory issue. The only difference between the config
for TransformerLens and HuggingFace is n_ctx being capped at 2048 for
memory concerns, where it is 32768 in HuggingFace. The generation isn't
entirely nonsense, it's just mostly generating French, and every once in a
while Spanish or English. What it is actually generating does for the most
part make sense as something that would follow the idiom being passed
through in the various languages. I could imagine that a lack of memory
could cause something like languages to get mixed up, and I think it may be
a good place to start investigating.
—
Reply to this email directly, view it on GitHub
<#570 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASRPNKLPYFXFDHBHXPSPNG3ZFUHQVAVCNFSM6AAAAABHGYQMD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBWGM3TQNZTHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Well I think I have a pretty decent idea on how to start with this. I will work with @joelburget to try and isolate the error, and hopefully we can find it relatively quickly. |
I ran some experiments in this notebook: https://gist.github.com/joelburget/bae5ea4d997e804b2a65d02d5b61f5bc
I'm surprised that there are more matching block outputs than mlp or attention, but whatever, I guess this is possible. Ideas for where to go from here? |
I am going to play around with this further tomorrow to see if I can figure out anything. I wonder if the problem lies in the MLP component itself. Maybe there is a discrepancy there, that only reveals itself in larger operations? We played around with the MOE component a bit last week, and that seems to be working as expected. We started working on looking at the weights, so having that ruled out is definitely a god step. I should have quite a bit of time to look at this, providing that nothing else comes up. I am going to start with playing around with the MLP outputs to see if I can figure anything out. |
@joelburget I have been messing around with this for the last 3-4 hours. Unfortunately, I was not able to load the model. I am working on the branch |
@haeggee suggested this: |
There's quite a bit out of sync from the transformers implementation https://github.com/huggingface/transformers/blob/b7672826cad31e30319487af876e608d8af7d37b/src/transformers/models/mixtral/modeling_mixtral.py#L843C1-L843C69. We looked at this quite a bit in our little coding session, but we were not able to pinpoint the actual cause. I am going to break down the forward on this, and add a full suite of unit tests with input/output grabbed from transformers to be able to test it in a more isolated manner. |
I think this has something to do with the W_Gate variable. That is the biggest difference between our implementation and the hugging face implementation. I may be completely off on this, but maybe the next thing to look at. I made a few changes to make it more inline with hugging face on my previously mentioned branch, and if anyone wants to mess around with that Gate until I have a chance to mess with it again, it may be a good place to look yet. There are still other differences between the two implementations, but that variable appears to be the most substantial at a glance. |
OK, so not 100% there, but it seems like it is closer. I got this result by changing the dtype on the W_Gate to torch.float. The second inference was mixed between English and French again, and the third was completely French. I think the issue lies somehow in the einops operation right at the top. Don't have more time to look at it today, but I think this is real progress. |
I tried @bryce13950's change, which unfortunately didn't seem to help. I tried both his branch (see mixtral-playing.ipynb) and a modification with the line |
Could it be a sliding attention issue ? |
@Butanium I will definitely play with that this afternoon |
@bryce13950's latest change seems promising. Still seems not exactly right, but closer? |
@joelburget Yeah I think the issue is a composite of a few different problems. The big breakthrough was last night with setting the |
OK, so another note. I tired changing |
When I am referring to mutable variables I am basically talking about any variable changing in a pass, so the example you gave is exactly what I am referring to among potentially other things. If we can figure out what exact variable mutation is causing the error, we should then be able to figure out where that error is coming from. We should be able to do that by looking at individual variables, preventing them from changing in any way between passes, and then seeing if passes continue to generate English between multiple passes. If we can do that, we should be able to work backwards and figure out where the variable is changing. As for the other errors, it honestly doesn’t really surprise me. I haven’t run smaller mixtral models since the PR was open to add them, but the smaller ones were working at that point. I think there is a distinct possibility that these sorts errors are negligible in smaller models, but then manifest in what we are currently seeing in these large models. Hopefully this whole effort will shed some light on how to make sure things are better kept in check with Hugging Face so that we can reliably support even larger models in the future. |
Are mixtral logits similar on a single prompt ? Is this just a |
@Butanium The histograms shown here are from a single pass through generate, I would think it would be the same as a single call to prompt. Honestly, it takes so long to run the full script, that I haven't ran it with prompt to verify. I can definitely play around to see if there is anything to that if I have a pause on testing various states. We have a new baseline. This was my last run. And the first generation... I also added a loop to the bottom of the script to run 10 more passes with the two different prompts after the histograms are generated. This is the result from the 2nd to 5th passes. Big step forward. They start degrading a very small amount in the second one, and then greatly in the third one. Only 1 of the 5 are generating French now, and only partially. Getting there. |
I completely replaced the MoE and MLP implementation with something that matches transformers exactly, outside of reshaping tensors to make them compatible with the rest of TransformerLens. That was a huge step forward, and it was working perfectly for the first 3 inferences. The forth one (a Once upon a time generation) spit out German, which seems better, since German is in the same language family as English. After that it went back to English for the next "Hello my name is" generation. With that, I think it is now probably pretty safe to rule out the actual MoE logic as the problem, since it is identical to the one we are trying to simulate. I am going to try to connect it back to the existing GatedMLP implementation, and see if it retains the same performance. I then need to go through the MoE, connect hooks, and make a couple more changes. |
Interesting @bryce13950, I started doing something similar, see #641. I'll take a look through your changes a bit later! |
@joelburget Very good! Your implementation is much improved. I am going to be pretty much rewriting the MLP structure during a shared coding session on Thursday. If you can address the couple changes I requested on the PR before then, I will get your PR wrapped up beforehand, and start the session with it included in my branch. If not, then I will start the session with wrapping up your PR. |
Alright, so this is probably going to be the last time I really look into this for a while. I discovered that if you generate only "Once upon a time" the generation still degrades rather rapidly. I am not sure if focusing on generation for that specific phrase is a really great use of time. That phrase is an idiom, and not just an idiom, but one that doesn't really make a whole lot of sense. We are all very use to it, but if you break it down word by word, it really doesn't make a whole lot of sense from a logical standpoint. When you translate it into different languages, the words themselves do not directly translate, and the phrase is often a series of words that make as little or even less sense than they do in English. All that to say that it is specifically a really weird and difficult piece of language. Here is my last generation of something like 20-30 passes...
|
In this weeks shared code session, we spent sometime adding Baichuan to TransformerLens. I was curious to add this model, since it generates both English and Chinese. On top of that, a lot of modern usage of Chinese mixes some English into the language for some modern things. I was curious given this relatively uncommon quirk to the language how it would work within TransformerLens, and it did generate a lot of nonsense, with the languages mixed. I then merged the last bit of work @joelburget did on the attention side of things, and that did seem to improve it marginally. I am writing tests for a big rework of all MLP like components in the library at the moment, and I am very curious to see what changes both here, and with the Baichuan models that were added the other day. I should have that done the first half of the week, and we should have a release ready at that point with this greatly improved, and a lot more understood about TransformerLens accuracy! |
Merged into dev. Release coming up shortly. |
Describe the bug
I followed the instructions in
docs/source/content/special_cases.md
as well as I could tell (ran the model in both full precision and withHookedTransformer.from_pretrained_no_processing
), yet my model generations were nonsensical.Code example
System Info
Describe the characteristic of your environment:
transformer_lens
was installed:pip install sae-lens
runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04
)Additional context
Running on a 4x A100 SXM system on Runpod.
Checklist
The text was updated successfully, but these errors were encountered: