Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functor Cholesky. #1138

Merged
merged 5 commits into from
Jul 2, 2020
Merged

Add functor Cholesky. #1138

merged 5 commits into from
Jul 2, 2020

Conversation

aterenin
Copy link
Contributor

This adds @functor Cholesky so that cpu and gpu work with Cholesky structs.

I have no idea what the right place to put this is - if there's somewhere better, please let me know.

@DhairyaLGandhi
Copy link
Member

Definitely needs test, and prior precedence in other libraries, torch does it for eg, but we might want the inverse too for performance anyway

@aterenin
Copy link
Contributor Author

Definitely needs test, and prior precedence in other libraries, torch does it for eg, but we might want the inverse too for performance anyway

Could you please clarify? This doesn't add any kind of new functionality to cholesky, which is already well-supported in Zygote and CuArrays. All it does is ensure that an already-computed factorization can move successfully from CPU to GPU and back, so that user-defined structs that cache a Cholesky struct as part of their internals are GPU-friendly.

I don't know what you are referring to when you mention the inverse here. I am happy to write a test but am not sure what a test for the above might actually look like given its extremely limited functionality.

@DhairyaLGandhi
Copy link
Member

One simple one would be to make a struct which has a field which accepts Cholesky, move it to the GPU, and differentiate it.

@aterenin
Copy link
Contributor Author

Okay, that sounds good, let me add this.

@aterenin
Copy link
Contributor Author

Since differentiation is already tested in Zygote, I added a simpler test which checks whether or not gpu and cpu move data around correctly, which is the focus of this PR. Let me know if this is OK!

@aterenin
Copy link
Contributor Author

aterenin commented May 4, 2020

@dhairyagandhi96 Any further feedback?

@DhairyaLGandhi
Copy link
Member

Bors r-

bors bot added a commit that referenced this pull request May 4, 2020
1138: Add functor Cholesky. r=dhairyagandhi96 a=aterenin

This adds `@functor Cholesky` so that `cpu` and `gpu` work with `Cholesky` structs.

I have no idea what the right place to put this is - if there's somewhere better, please let me know.

Co-authored-by: aterenin <[email protected]>
@bors
Copy link
Contributor

bors bot commented May 4, 2020

Canceled.

@DhairyaLGandhi
Copy link
Member

One thing, since by functoring Cholesky, we basically state that zygote can look into it's fields and assume how they may be updated.

Is there anything we need to handle specifically about how the factors need to be updated once we have the gradients?

@aterenin
Copy link
Contributor Author

aterenin commented May 4, 2020

One thing, since by functoring Cholesky, we basically state that zygote can look into it's fields and assume how they may be updated.

Is there anything we need to handle specifically about how the factors need to be updated once we have the gradients?

Cholesky is defined here: https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/cholesky.jl

Zygote shouldn't do anything with the BLAS parameters info and uplo, since Cholesky is immutable. The other field of the struct is factors, which is the underlying array in which the factorization is stored. The diagonal of this array should be non-negative to ensure positive-semi-definiteness. I've been handling this in my own code by storing the array and diagonal separately, and exponentiating the diagonal - but in my code, the field containing a Cholesky is marked as non-trainable in the struct that stores it.

If there's a simple way to ensure that the diagonal is positive after gradients are applied (for instance, by taking its absolute value afterwards), this could be considered. As an alternative, we could also make Cholesky non-trainable, or simply do nothing and allow the possibility of ill-defined factorizations.

@DhairyaLGandhi
Copy link
Member

Basically it would lead to method errors like update!(opt, Cholesky, Array)

@aterenin
Copy link
Contributor Author

aterenin commented May 4, 2020

Basically it would lead to method errors like update!(opt, Cholesky, Array)

Thanks. Of the three solutions above, setting the diagonal to its absolute value after gradients are applied seems most sensible, and is also what TensorFlow/GPFlow does IIRC. I'll add this now.

Finally: getproperty for Cholesky is overloaded by LinearAlgebra to provide some convenience methods. If there is anything special that needs to be done in Zygote because of this, please let me know!

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented May 4, 2020

Right, handling the factorization is the exact point I am interested in, since we don't want to update the fields without making sure the diagonal remains non negative.

I'll ping @willtebbutt here, so he is in the loop here

@DhairyaLGandhi
Copy link
Member

Sorry about the delay and thank you for understanding. We don't want to do the incorrect thing here so I am sure you understand going over the finer details here

@DhairyaLGandhi
Copy link
Member

On second thoughts we might be able to get away with it since we have the adjoints already in zygote. Only thing would be to ensure the non negative diagonals I think

@aterenin
Copy link
Contributor Author

aterenin commented May 4, 2020

Sorry about the delay and thank you for understanding. We don't want to do the incorrect thing here so I am sure you understand going over the finer details here

No problem! Being able to do this actually simplifies SparseGaussianProcesses.jl since I can just put the trainable parameters inside the Cholesky, which is a plus for me.

In principle, one could try to get away with just letting the factorization be negative, since a factorization with negative diagonal values will be re-assembled into the same matrix as one with non-negative diagonal values, but this seems frightening since the array might be used for triangular solves and other BLAS calls which will expect them to be non-negative.

Ensuring non-negative diagonals is a bit tricky. We can't just overload update! because it won't be called with update!(opt, c::Cholesky, x) but rather update!(opt, p::AbstractMatrix, x) where p is c.factors, so by the time it is called, the type system no longer knows that the array is contained inside the Cholesky. I would welcome suggestions on how to address this, since I don't think I entirely understand what @functor and related machinery is doing under the hood.

Another alternative is marking factors as non-trainable.

Let me know what is the best approach here.

@willtebbutt
Copy link
Member

I can see why you would want something like this, but I'm not sure it's a great idea and could cause weird numerical headaches if you eg. just take the absolute value of the diagonal after updating to ensure positivity of the diagonal elements, and it really would be problematic if there are negative values on the diagonal because things like logdet won't work.

Is there a reason this needs to be baked into Flux, and can't just be handled via a downstream package with a regular constrained matrix, and a call to Cholesky?

@DhairyaLGandhi
Copy link
Member

Could make factors as non-trainable, but if we can get away with hooking into zygote to perform optimisations respecting the structures more generally (see FluxML/Functors.jl#1) we should be able to handle this more gracefully, where we just add a hook into the update rule. The current system is great to since it almost bypasses the type completely. So we may not need to worry about this much.

@aterenin
Copy link
Contributor Author

aterenin commented May 4, 2020

I can see why you would want something like this, but I'm not sure it's a great idea and could cause weird numerical headaches if you eg. just take the absolute value of the diagonal after updating to ensure positivity of the diagonal elements, and it really would be problematic if there are negative values on the diagonal because things like logdet won't work.

Is there a reason this needs to be baked into Flux, and can't just be handled via a downstream package with a regular constrained matrix, and a call to Cholesky?

What kind of numerical headaches would this cause? To my knowledge, an analogous trick is already done in TensorFlow/GPFlow seemingly without much trouble - there, they take the absolute value of the diagonal.

More generally, I think @functor UpperTriangular and the like should also be added, but that is a matter for a different PR. Note that this isn't just about making these structs trainable, but also allowing them to work with cpu and gpu seamlessly, along with other methods.

The logdet example shows that figuring out how to handle diagonal values is critical for this approach to be sensible.

Could make factors as non-trainable, but if we can get away with hooking into zygote to perform optimisations respecting the structures more generally (see FluxML/Functors.jl#1) we should be able to handle this more gracefully, where we just add a hook into the update rule. The current system is great to since it almost bypasses the type completely. So we may not need to worry about this much.

Ah nice, so this functionality is being split into a component package. Should this PR wait until that functionality is completed?

@DhairyaLGandhi
Copy link
Member

We don't need to worry about that package for this, it's nicer tooling, but we could imitate that with our existing structure just fine too

@willtebbutt
Copy link
Member

What kind of numerical headaches would this cause? To my knowledge, an analogous trick is already done in TensorFlow/GPFlow seemingly without much trouble - there, they take the absolute value of the diagonal.

Well generally I just assume that if you introduce discontinuities into your optimisation surface, things go wrong. If you're in the stochastic setting then you'll probably never see them, but if you're doing something deterministic then they tend to show up.

More generally, I think @functor UpperTriangular and the like should also be added, but that is a matter for a different PR.

This I completely agree with because there's no issue regarding the constraints.

I generally agree regarding ease of use, but Flux just doesn't have a good answer to trainables with constraints at the minute (unless I've missed something @dhairyagandhi96 ?) and baking stuff in that people can't easily opt out of unless it's really the only sensible thing you can think to do feels like the wrong move to me.

@aterenin
Copy link
Contributor Author

aterenin commented May 4, 2020

What kind of numerical headaches would this cause? To my knowledge, an analogous trick is already done in TensorFlow/GPFlow seemingly without much trouble - there, they take the absolute value of the diagonal.

Well generally I just assume that if you introduce discontinuities into your optimisation surface, things go wrong. If you're in the stochastic setting then you'll probably never see them, but if you're doing something deterministic then they tend to show up.

More generally, I think @functor UpperTriangular and the like should also be added, but that is a matter for a different PR.

This I completely agree with because there's no issue regarding the constraints.

I generally agree regarding ease of use, but Flux just doesn't have a good answer to trainables with constraints at the minute (unless I've missed something @dhairyagandhi96 ?) and baking stuff in that people can't easily opt out of unless it's really the only sensible thing you can think to do feels like the wrong move to me.

Sorry if I'm missing something, but I don't quite see how this introduces discontinuities, because I don't see how it modifies the loss surface at all. The proposed solution here is, if we have a gradient update which results in the negative diagonal i.e. hits the boundary of the loss surface, we simply "reflect" back onto the loss surface by ensuring the update is positive, and the boundary isn't violated. This choice definitely isn't canonical, but seems fairly reasonable unless I've missed some subtlety.

If we do want to keep current behavior, we can make Cholesky non-trainable. I've (perhaps temporarily) added this to my PR, to replicate current behavior while also allowing gpu and cpu to work correctly.

@willtebbutt
Copy link
Member

Sorry if I'm missing something, but I don't quite see how this introduces discontinuities, because I don't see how it modifies the loss surface at all. The proposed solution here is, if we have a gradient update which results in the negative diagonal i.e. hits the boundary of the loss surface, we simply "reflect" back onto the loss surface by ensuring the update is positive, and the boundary isn't violated. This choice definitely isn't canonical, but seems fairly reasonable unless I've missed some subtlety.

Hmm yeah, my argument was very hand-wavy. From the perspective you present though, you'll be modifying whichever gradient-based optimisation algorithm that is run by the user. eg. if you're running vanilla SGD you've now introduced the reflection you discuss, but only under very specific conditions -- it's very much an edge case.

I agree that there are probably occasions on which that's what you want to do, and this might be a choice that a GP package would want to make, but I can't think of a good reason why this is something that Flux should be baking in.

In short, it would take whichever gradient-based optimisation algorithm the user wanted to run and replacing it with something else, which wasn't the thing they thought they were running.

@aterenin
Copy link
Contributor Author

aterenin commented May 4, 2020

Sorry if I'm missing something, but I don't quite see how this introduces discontinuities, because I don't see how it modifies the loss surface at all. The proposed solution here is, if we have a gradient update which results in the negative diagonal i.e. hits the boundary of the loss surface, we simply "reflect" back onto the loss surface by ensuring the update is positive, and the boundary isn't violated. This choice definitely isn't canonical, but seems fairly reasonable unless I've missed some subtlety.

Hmm yeah, my argument was very hand-wavy. From the perspective you present though, you'll be modifying whichever gradient-based optimisation algorithm that is run by the user. eg. if you're running vanilla SGD you've now introduced the reflection you discuss, but only under very specific conditions -- it's very much an edge case.

I agree that there are probably occasions on which that's what you want to do, and this might be a choice that a GP package would want to make, but I can't think of a good reason why this is something that Flux should be baking in.

In short, it would take whichever gradient-based optimisation algorithm the user wanted to run and replacing it with something else, which wasn't the thing they thought they were running.

That's correct. The tradeoff here is either (a) one makes this choice for the end user by slightly modifying their optimizer to avoid an ill-defined edge case, or (b) one requires the user to make the choice themselves.

I'm happy with either one, though I perhaps slightly prefer (a). The current PR implements (b) by setting Cholesky to be non-trainable (which it already is by default). I am happy to modify the PR to implement (a) if this is what Flux.jl wants, but don't see a way to do it (at least without something like FluxML/Functors.jl#1). Perhaps it makes sense to implement (b) in the meantime and revisit once that gets added?

@CarloLucibello
Copy link
Member

ok, so it's reasonable to go along with this for the time being. @aterenin could you rebase?

@aterenin
Copy link
Contributor Author

aterenin commented Jul 2, 2020

ok, so it's reasonable to go along with this for the time being. @aterenin could you rebase?

Done!

@CarloLucibello
Copy link
Member

bors r+

@bors
Copy link
Contributor

bors bot commented Jul 2, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants