-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functor Cholesky. #1138
Add functor Cholesky. #1138
Conversation
Definitely needs test, and prior precedence in other libraries, torch does it for eg, but we might want the inverse too for performance anyway |
Could you please clarify? This doesn't add any kind of new functionality to I don't know what you are referring to when you mention the inverse here. I am happy to write a test but am not sure what a test for the above might actually look like given its extremely limited functionality. |
One simple one would be to make a |
Okay, that sounds good, let me add this. |
Since differentiation is already tested in Zygote, I added a simpler test which checks whether or not |
@dhairyagandhi96 Any further feedback? |
Bors r- |
1138: Add functor Cholesky. r=dhairyagandhi96 a=aterenin This adds `@functor Cholesky` so that `cpu` and `gpu` work with `Cholesky` structs. I have no idea what the right place to put this is - if there's somewhere better, please let me know. Co-authored-by: aterenin <[email protected]>
Canceled. |
One thing, since by functoring Cholesky, we basically state that zygote can look into it's fields and assume how they may be updated. Is there anything we need to handle specifically about how the factors need to be updated once we have the gradients? |
Zygote shouldn't do anything with the BLAS parameters If there's a simple way to ensure that the diagonal is positive after gradients are applied (for instance, by taking its absolute value afterwards), this could be considered. As an alternative, we could also make |
Basically it would lead to method errors like |
Thanks. Of the three solutions above, setting the diagonal to its absolute value after gradients are applied seems most sensible, and is also what TensorFlow/GPFlow does IIRC. I'll add this now. Finally: |
Right, handling the factorization is the exact point I am interested in, since we don't want to update the fields without making sure the diagonal remains non negative. I'll ping @willtebbutt here, so he is in the loop here |
Sorry about the delay and thank you for understanding. We don't want to do the incorrect thing here so I am sure you understand going over the finer details here |
On second thoughts we might be able to get away with it since we have the adjoints already in zygote. Only thing would be to ensure the non negative diagonals I think |
No problem! Being able to do this actually simplifies SparseGaussianProcesses.jl since I can just put the trainable parameters inside the Cholesky, which is a plus for me. In principle, one could try to get away with just letting the factorization be negative, since a factorization with negative diagonal values will be re-assembled into the same matrix as one with non-negative diagonal values, but this seems frightening since the array might be used for triangular solves and other BLAS calls which will expect them to be non-negative. Ensuring non-negative diagonals is a bit tricky. We can't just overload Another alternative is marking Let me know what is the best approach here. |
I can see why you would want something like this, but I'm not sure it's a great idea and could cause weird numerical headaches if you eg. just take the absolute value of the diagonal after updating to ensure positivity of the diagonal elements, and it really would be problematic if there are negative values on the diagonal because things like Is there a reason this needs to be baked into Flux, and can't just be handled via a downstream package with a regular constrained matrix, and a call to |
Could make |
What kind of numerical headaches would this cause? To my knowledge, an analogous trick is already done in TensorFlow/GPFlow seemingly without much trouble - there, they take the absolute value of the diagonal. More generally, I think The
Ah nice, so this functionality is being split into a component package. Should this PR wait until that functionality is completed? |
We don't need to worry about that package for this, it's nicer tooling, but we could imitate that with our existing structure just fine too |
Well generally I just assume that if you introduce discontinuities into your optimisation surface, things go wrong. If you're in the stochastic setting then you'll probably never see them, but if you're doing something deterministic then they tend to show up.
This I completely agree with because there's no issue regarding the constraints. I generally agree regarding ease of use, but Flux just doesn't have a good answer to trainables with constraints at the minute (unless I've missed something @dhairyagandhi96 ?) and baking stuff in that people can't easily opt out of unless it's really the only sensible thing you can think to do feels like the wrong move to me. |
Sorry if I'm missing something, but I don't quite see how this introduces discontinuities, because I don't see how it modifies the loss surface at all. The proposed solution here is, if we have a gradient update which results in the negative diagonal i.e. hits the boundary of the loss surface, we simply "reflect" back onto the loss surface by ensuring the update is positive, and the boundary isn't violated. This choice definitely isn't canonical, but seems fairly reasonable unless I've missed some subtlety. If we do want to keep current behavior, we can make |
Hmm yeah, my argument was very hand-wavy. From the perspective you present though, you'll be modifying whichever gradient-based optimisation algorithm that is run by the user. eg. if you're running vanilla SGD you've now introduced the reflection you discuss, but only under very specific conditions -- it's very much an edge case. I agree that there are probably occasions on which that's what you want to do, and this might be a choice that a GP package would want to make, but I can't think of a good reason why this is something that Flux should be baking in. In short, it would take whichever gradient-based optimisation algorithm the user wanted to run and replacing it with something else, which wasn't the thing they thought they were running. |
That's correct. The tradeoff here is either (a) one makes this choice for the end user by slightly modifying their optimizer to avoid an ill-defined edge case, or (b) one requires the user to make the choice themselves. I'm happy with either one, though I perhaps slightly prefer (a). The current PR implements (b) by setting |
ok, so it's reasonable to go along with this for the time being. @aterenin could you rebase? |
Done! |
bors r+ |
Build succeeded: |
This adds
@functor Cholesky
so thatcpu
andgpu
work withCholesky
structs.I have no idea what the right place to put this is - if there's somewhere better, please let me know.