Add functor Cholesky. #1138

aterenin · 2020-04-22T23:33:03Z

This adds @functor Cholesky so that cpu and gpu work with Cholesky structs.

I have no idea what the right place to put this is - if there's somewhere better, please let me know.

DhairyaLGandhi · 2020-04-23T05:25:48Z

Definitely needs test, and prior precedence in other libraries, torch does it for eg, but we might want the inverse too for performance anyway

aterenin · 2020-04-23T12:04:05Z

Definitely needs test, and prior precedence in other libraries, torch does it for eg, but we might want the inverse too for performance anyway

Could you please clarify? This doesn't add any kind of new functionality to cholesky, which is already well-supported in Zygote and CuArrays. All it does is ensure that an already-computed factorization can move successfully from CPU to GPU and back, so that user-defined structs that cache a Cholesky struct as part of their internals are GPU-friendly.

I don't know what you are referring to when you mention the inverse here. I am happy to write a test but am not sure what a test for the above might actually look like given its extremely limited functionality.

DhairyaLGandhi · 2020-04-23T12:06:47Z

One simple one would be to make a struct which has a field which accepts Cholesky, move it to the GPU, and differentiate it.

aterenin · 2020-04-23T16:18:59Z

Okay, that sounds good, let me add this.

aterenin · 2020-04-25T12:09:17Z

Since differentiation is already tested in Zygote, I added a simpler test which checks whether or not gpu and cpu move data around correctly, which is the focus of this PR. Let me know if this is OK!

aterenin · 2020-05-04T13:52:48Z

@dhairyagandhi96 Any further feedback?

DhairyaLGandhi · 2020-05-04T13:59:46Z

Bors r-

1138: Add functor Cholesky. r=dhairyagandhi96 a=aterenin This adds `@functor Cholesky` so that `cpu` and `gpu` work with `Cholesky` structs. I have no idea what the right place to put this is - if there's somewhere better, please let me know. Co-authored-by: aterenin <[email protected]>

bors · 2020-05-04T13:59:47Z

Canceled.

DhairyaLGandhi · 2020-05-04T14:07:20Z

One thing, since by functoring Cholesky, we basically state that zygote can look into it's fields and assume how they may be updated.

Is there anything we need to handle specifically about how the factors need to be updated once we have the gradients?

aterenin · 2020-05-04T14:20:50Z

One thing, since by functoring Cholesky, we basically state that zygote can look into it's fields and assume how they may be updated.

Is there anything we need to handle specifically about how the factors need to be updated once we have the gradients?

Cholesky is defined here: https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/cholesky.jl

Zygote shouldn't do anything with the BLAS parameters info and uplo, since Cholesky is immutable. The other field of the struct is factors, which is the underlying array in which the factorization is stored. The diagonal of this array should be non-negative to ensure positive-semi-definiteness. I've been handling this in my own code by storing the array and diagonal separately, and exponentiating the diagonal - but in my code, the field containing a Cholesky is marked as non-trainable in the struct that stores it.

If there's a simple way to ensure that the diagonal is positive after gradients are applied (for instance, by taking its absolute value afterwards), this could be considered. As an alternative, we could also make Cholesky non-trainable, or simply do nothing and allow the possibility of ill-defined factorizations.

DhairyaLGandhi · 2020-05-04T14:23:46Z

Basically it would lead to method errors like update!(opt, Cholesky, Array)

aterenin · 2020-05-04T14:25:56Z

Basically it would lead to method errors like update!(opt, Cholesky, Array)

Thanks. Of the three solutions above, setting the diagonal to its absolute value after gradients are applied seems most sensible, and is also what TensorFlow/GPFlow does IIRC. I'll add this now.

Finally: getproperty for Cholesky is overloaded by LinearAlgebra to provide some convenience methods. If there is anything special that needs to be done in Zygote because of this, please let me know!

DhairyaLGandhi · 2020-05-04T14:27:21Z

Right, handling the factorization is the exact point I am interested in, since we don't want to update the fields without making sure the diagonal remains non negative.

I'll ping @willtebbutt here, so he is in the loop here

DhairyaLGandhi · 2020-05-04T14:29:05Z

Sorry about the delay and thank you for understanding. We don't want to do the incorrect thing here so I am sure you understand going over the finer details here

DhairyaLGandhi · 2020-05-04T14:51:55Z

On second thoughts we might be able to get away with it since we have the adjoints already in zygote. Only thing would be to ensure the non negative diagonals I think

aterenin · 2020-05-04T14:59:37Z

Sorry about the delay and thank you for understanding. We don't want to do the incorrect thing here so I am sure you understand going over the finer details here

No problem! Being able to do this actually simplifies SparseGaussianProcesses.jl since I can just put the trainable parameters inside the Cholesky, which is a plus for me.

In principle, one could try to get away with just letting the factorization be negative, since a factorization with negative diagonal values will be re-assembled into the same matrix as one with non-negative diagonal values, but this seems frightening since the array might be used for triangular solves and other BLAS calls which will expect them to be non-negative.

Ensuring non-negative diagonals is a bit tricky. We can't just overload update! because it won't be called with update!(opt, c::Cholesky, x) but rather update!(opt, p::AbstractMatrix, x) where p is c.factors, so by the time it is called, the type system no longer knows that the array is contained inside the Cholesky. I would welcome suggestions on how to address this, since I don't think I entirely understand what @functor and related machinery is doing under the hood.

Another alternative is marking factors as non-trainable.

Let me know what is the best approach here.

willtebbutt · 2020-05-04T15:08:04Z

I can see why you would want something like this, but I'm not sure it's a great idea and could cause weird numerical headaches if you eg. just take the absolute value of the diagonal after updating to ensure positivity of the diagonal elements, and it really would be problematic if there are negative values on the diagonal because things like logdet won't work.

Is there a reason this needs to be baked into Flux, and can't just be handled via a downstream package with a regular constrained matrix, and a call to Cholesky?

DhairyaLGandhi · 2020-05-04T15:12:15Z

Could make factors as non-trainable, but if we can get away with hooking into zygote to perform optimisations respecting the structures more generally (see FluxML/Functors.jl#1) we should be able to handle this more gracefully, where we just add a hook into the update rule. The current system is great to since it almost bypasses the type completely. So we may not need to worry about this much.

aterenin · 2020-05-04T15:18:44Z

I can see why you would want something like this, but I'm not sure it's a great idea and could cause weird numerical headaches if you eg. just take the absolute value of the diagonal after updating to ensure positivity of the diagonal elements, and it really would be problematic if there are negative values on the diagonal because things like logdet won't work.

Is there a reason this needs to be baked into Flux, and can't just be handled via a downstream package with a regular constrained matrix, and a call to Cholesky?

What kind of numerical headaches would this cause? To my knowledge, an analogous trick is already done in TensorFlow/GPFlow seemingly without much trouble - there, they take the absolute value of the diagonal.

More generally, I think @functor UpperTriangular and the like should also be added, but that is a matter for a different PR. Note that this isn't just about making these structs trainable, but also allowing them to work with cpu and gpu seamlessly, along with other methods.

The logdet example shows that figuring out how to handle diagonal values is critical for this approach to be sensible.

Could make factors as non-trainable, but if we can get away with hooking into zygote to perform optimisations respecting the structures more generally (see FluxML/Functors.jl#1) we should be able to handle this more gracefully, where we just add a hook into the update rule. The current system is great to since it almost bypasses the type completely. So we may not need to worry about this much.

Ah nice, so this functionality is being split into a component package. Should this PR wait until that functionality is completed?

DhairyaLGandhi · 2020-05-04T15:30:18Z

We don't need to worry about that package for this, it's nicer tooling, but we could imitate that with our existing structure just fine too

willtebbutt · 2020-05-04T15:38:35Z

What kind of numerical headaches would this cause? To my knowledge, an analogous trick is already done in TensorFlow/GPFlow seemingly without much trouble - there, they take the absolute value of the diagonal.

Well generally I just assume that if you introduce discontinuities into your optimisation surface, things go wrong. If you're in the stochastic setting then you'll probably never see them, but if you're doing something deterministic then they tend to show up.

More generally, I think @functor UpperTriangular and the like should also be added, but that is a matter for a different PR.

This I completely agree with because there's no issue regarding the constraints.

I generally agree regarding ease of use, but Flux just doesn't have a good answer to trainables with constraints at the minute (unless I've missed something @dhairyagandhi96 ?) and baking stuff in that people can't easily opt out of unless it's really the only sensible thing you can think to do feels like the wrong move to me.

aterenin · 2020-05-04T15:52:07Z

What kind of numerical headaches would this cause? To my knowledge, an analogous trick is already done in TensorFlow/GPFlow seemingly without much trouble - there, they take the absolute value of the diagonal.

Well generally I just assume that if you introduce discontinuities into your optimisation surface, things go wrong. If you're in the stochastic setting then you'll probably never see them, but if you're doing something deterministic then they tend to show up.

More generally, I think @functor UpperTriangular and the like should also be added, but that is a matter for a different PR.

This I completely agree with because there's no issue regarding the constraints.

I generally agree regarding ease of use, but Flux just doesn't have a good answer to trainables with constraints at the minute (unless I've missed something @dhairyagandhi96 ?) and baking stuff in that people can't easily opt out of unless it's really the only sensible thing you can think to do feels like the wrong move to me.

Sorry if I'm missing something, but I don't quite see how this introduces discontinuities, because I don't see how it modifies the loss surface at all. The proposed solution here is, if we have a gradient update which results in the negative diagonal i.e. hits the boundary of the loss surface, we simply "reflect" back onto the loss surface by ensuring the update is positive, and the boundary isn't violated. This choice definitely isn't canonical, but seems fairly reasonable unless I've missed some subtlety.

If we do want to keep current behavior, we can make Cholesky non-trainable. I've (perhaps temporarily) added this to my PR, to replicate current behavior while also allowing gpu and cpu to work correctly.

willtebbutt · 2020-05-04T16:12:11Z

Sorry if I'm missing something, but I don't quite see how this introduces discontinuities, because I don't see how it modifies the loss surface at all. The proposed solution here is, if we have a gradient update which results in the negative diagonal i.e. hits the boundary of the loss surface, we simply "reflect" back onto the loss surface by ensuring the update is positive, and the boundary isn't violated. This choice definitely isn't canonical, but seems fairly reasonable unless I've missed some subtlety.

Hmm yeah, my argument was very hand-wavy. From the perspective you present though, you'll be modifying whichever gradient-based optimisation algorithm that is run by the user. eg. if you're running vanilla SGD you've now introduced the reflection you discuss, but only under very specific conditions -- it's very much an edge case.

I agree that there are probably occasions on which that's what you want to do, and this might be a choice that a GP package would want to make, but I can't think of a good reason why this is something that Flux should be baking in.

In short, it would take whichever gradient-based optimisation algorithm the user wanted to run and replacing it with something else, which wasn't the thing they thought they were running.

aterenin · 2020-05-04T16:22:39Z

Sorry if I'm missing something, but I don't quite see how this introduces discontinuities, because I don't see how it modifies the loss surface at all. The proposed solution here is, if we have a gradient update which results in the negative diagonal i.e. hits the boundary of the loss surface, we simply "reflect" back onto the loss surface by ensuring the update is positive, and the boundary isn't violated. This choice definitely isn't canonical, but seems fairly reasonable unless I've missed some subtlety.

Hmm yeah, my argument was very hand-wavy. From the perspective you present though, you'll be modifying whichever gradient-based optimisation algorithm that is run by the user. eg. if you're running vanilla SGD you've now introduced the reflection you discuss, but only under very specific conditions -- it's very much an edge case.

I agree that there are probably occasions on which that's what you want to do, and this might be a choice that a GP package would want to make, but I can't think of a good reason why this is something that Flux should be baking in.

In short, it would take whichever gradient-based optimisation algorithm the user wanted to run and replacing it with something else, which wasn't the thing they thought they were running.

That's correct. The tradeoff here is either (a) one makes this choice for the end user by slightly modifying their optimizer to avoid an ill-defined edge case, or (b) one requires the user to make the choice themselves.

I'm happy with either one, though I perhaps slightly prefer (a). The current PR implements (b) by setting Cholesky to be non-trainable (which it already is by default). I am happy to modify the PR to implement (a) if this is what Flux.jl wants, but don't see a way to do it (at least without something like FluxML/Functors.jl#1). Perhaps it makes sense to implement (b) in the meantime and revisit once that gets added?

CarloLucibello · 2020-07-02T17:13:59Z

ok, so it's reasonable to go along with this for the time being. @aterenin could you rebase?

…master

aterenin · 2020-07-02T18:54:02Z

ok, so it's reasonable to go along with this for the time being. @aterenin could you rebase?

Done!

CarloLucibello · 2020-07-02T19:32:03Z

bors r+

bors · 2020-07-02T20:18:31Z

Build succeeded:

ci/gitlab/gitlab.com

Add functor Cholesky.

621b4cb

aterenin force-pushed the master branch from b499356 to 69d8e29 Compare April 25, 2020 12:06

Added test for functor Cholesky.

d5cb27e

aterenin force-pushed the master branch from 69d8e29 to d5cb27e Compare April 25, 2020 12:11

Make Cholesky non-trainable.

a2f681f

aterenin added 2 commits July 2, 2020 19:52

Merge branch 'master' of git://github.com/FluxML/Flux.jl into FluxML-…

baf9bb2

…master

Merge branch 'FluxML-master'

dbe75ad

bors bot merged commit f3331e4 into FluxML:master Jul 2, 2020

ToucheSir mentioned this pull request Feb 12, 2021

tied weights (by transposition) are not tied when sent to gpu #1504

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functor Cholesky. #1138

Add functor Cholesky. #1138

aterenin commented Apr 22, 2020

DhairyaLGandhi commented Apr 23, 2020

aterenin commented Apr 23, 2020

DhairyaLGandhi commented Apr 23, 2020

aterenin commented Apr 23, 2020

aterenin commented Apr 25, 2020

aterenin commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

bors bot commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

aterenin commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

aterenin commented May 4, 2020

DhairyaLGandhi commented May 4, 2020 •

edited

Loading

DhairyaLGandhi commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

aterenin commented May 4, 2020

willtebbutt commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

aterenin commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

willtebbutt commented May 4, 2020

aterenin commented May 4, 2020

willtebbutt commented May 4, 2020

aterenin commented May 4, 2020

CarloLucibello commented Jul 2, 2020

aterenin commented Jul 2, 2020

CarloLucibello commented Jul 2, 2020

bors bot commented Jul 2, 2020

Add functor Cholesky. #1138

Add functor Cholesky. #1138

Conversation

aterenin commented Apr 22, 2020

DhairyaLGandhi commented Apr 23, 2020

aterenin commented Apr 23, 2020

DhairyaLGandhi commented Apr 23, 2020

aterenin commented Apr 23, 2020

aterenin commented Apr 25, 2020

aterenin commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

bors bot commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

aterenin commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

aterenin commented May 4, 2020

DhairyaLGandhi commented May 4, 2020 • edited Loading

DhairyaLGandhi commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

aterenin commented May 4, 2020

willtebbutt commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

aterenin commented May 4, 2020

DhairyaLGandhi commented May 4, 2020

willtebbutt commented May 4, 2020

aterenin commented May 4, 2020

willtebbutt commented May 4, 2020

aterenin commented May 4, 2020

CarloLucibello commented Jul 2, 2020

aterenin commented Jul 2, 2020

CarloLucibello commented Jul 2, 2020

bors bot commented Jul 2, 2020

DhairyaLGandhi commented May 4, 2020 •

edited

Loading