-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: neuron rotation #50
base: dev
Are you sure you want to change the base?
Conversation
A good idea would be to change the rotation rate of the orthogonal matrix I have not tested this, but I believe we will run into the problem that it will be even slower. According to GPT, we have to compute a matrix log and then a matrix power. I'll run some tests to see how well that works. |
GPT4 found something called the cayley transform, which seems to do what we want. 2D with an arbitrary matrix 3D with an arbitrary matrix link to discussion with GPT4: https://chat.openai.com/share/96a9b2ae-3a5f-47ce-8b22-bb07e5f6d1a9 |
It turns out that it is not that much slower. I'm hitting ~6 minutes when merging using |
I ran more tests and cayley seems to break down on the 1D case. I need to spend more time on alpha to make it work. |
I found that you can apply a fractional power to the eigenvalues of a matrix to implement a fractional matrix exponent. This is fairly slow and requires double precision to work, otherwise the output gets an imaginary component because of precision errors. With alpha not an integer, it takes ~15 minutes to merge with |
Notes on a couple of trade-offs I had to look into:
|
The models I tested against were not completely different, in particular the text encoder was the same. This skewed my small benchmarks for expected merge times. It seems to take 9 minutes to merge 2 v1 models with all different keys. Turning this to draft until we get merge time lower or determine that the merge method is valuable enough to outweight 9 minutes. |
sd_meh/merge_methods.py
Outdated
if len(a.shape) == 1 or len(a.shape) == 2 and a.shape[0] == 1: | ||
return new_centroid.reshape_as(a) | ||
|
||
svd_driver = "gesvd" if a.is_cuda else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a lot more complex than meets the eye. We should be determining the svd driver based on the size of the matrix. Different drivers perform faster on smaller/bigger matrices. And in some instances the CPU will out perform the GPU. What exactly is our average matrix size when we call svd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we include all keys, it goes form
I've never done this before at all, this is all new to me. Appreciate the help. IIUC, this only matters on cuda devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all matrices sizes that currently go through svd are listed below:
- 320x320: 47 keys
- 640x640: 48 keys
- 768x768: 94 keys
- 960x960: 2 keys
- 1280x1280: 83 keys
- 2560x2560: 10 keys
- 3072x3072: 12 keys
- 5120x5120: 6 keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some benchmarking between jax's svd functions jitted through XLA and pytorch's different drivers on a colab using a v100 (a 3080 is about equal to this in PyTorch Performance), and these were the results.
Basically unless you need full accuracy, even with full_matrices set to true, gesvdj is going to be faster. However the speed you gain comes at the cost of some accuracy, and the potential to not always converge without needing to fall back to gesvd.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way full_matrices=False
doesn't produce a reduced SVD when
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did complete a full merge on CUDA, and didn't receive the error. I think it has something to do with trying to move models between the CPU and GPU, interacting with WebUI keeping models loaded in memory. Is there sanity checking when the models are loaded to ensure that they have been moved to CPU if the work_device is set to CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before merging, when assembling the merge args, the weights are sent to the requested device:
Lines 465 to 466 in 2780321
"a": thetas["model_a"][key].to(work_device), | |
"b": thetas["model_b"][key].to(work_device), |
note that if
work_device
is None
, it takes the value of device
:Lines 371 to 372 in 2780321
if work_device is None: | |
work_device = device |
So IIUC, it shouldn't be a device issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found the culprit.
It seems that on CPU there isn't enough precision sometimes, which leads too
When the determinant of
u[:, -1] /= torch.det(u) * torch.det(v_t)
So the last column of u
sometimes is filled with infinities. Then, when trying to compute the eigenvalues of the matrix, an error is then raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted below, while this prevents the entire merge from raising an error, rotations with invalid determinants still result in a broken merge. I went the other direction and raised an error instead.
Wanted to try with the bayesian merger extension. Added the changed parts to merge_methods.py At the end of stage 1 got this error
(fix is using cuda as device instead of cpu) |
See the discussion here #50 (comment). This can happen when merging on the CPU with fractional alpha. |
For each key:
Note: this is pretty slow. On my RTX 3080 it takes me ~3 minutes to merge two models.