-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixtral scaling: Reduce perplexity from 4.294 to 4.269 #301
base: main
Are you sure you want to change the base?
Conversation
CC @younesbelkada. Not sure if this would break anything in the transformers integration. WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @casper-hansen !
Unless I am missing something I think this should be all good for transformers integration for now, meaning this PR is totally BC with transforers integration since it does not seem to touch any core module that we use (GEMM_Linear). If we want to upstream this into transformers we need to think about an API for users to replace the SparseMoE layers with ScaledMixtralSparseMoeBlock
As a sanity check I would run basic inference with transformers after emrging this PR just to be sure, but looking at the PR it does not seem to do anything that would break transformers integration |
Right, but we do introduce a scale in the MoE layer that needs to be loaded in order to run inference. Could it be possible to add this during the loading of the linear layers? FYI, for MPT models, we need the same thing but just for a different layer. |
the ppl improvement is really small did you try other scores to see if this is worth it ? |
This is how it should have been implemented from the start as this is a more correct implementation of AWQ. Yes, I have tried other combinations but it does not lead to better perplexity (at least the one’s I tried) |
This PR reduces the perplexity of Mixtral by introducing scaling of individual experts instead of one scale for all the experts in the MoE block.
To use this in vLLM, GGUF, or other integrations, their code needs to be updated to load the new ScaledMixtralSparseMoeBlock. The only modification needed in the inference code is right before the experts get the
current_hidden_states
:Idea by @Sakits, engineering & implementation by @casper-hansen