Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dense instead of sparse matrix returned during differentiation #1537

Open
vboussange opened this issue Nov 3, 2024 · 4 comments
Open

Dense instead of sparse matrix returned during differentiation #1537

vboussange opened this issue Nov 3, 2024 · 4 comments
Labels
ChainRules adjoint -> rrule, and further integration CUDA All things GPU

Comments

@vboussange
Copy link

vboussange commented Nov 3, 2024

Hi there,
I found an inconsistent behaviour when differentiating a function which takes in an AbstractCuSparseArray versus the same function differentiating an AbstractSparseArray

using CUDA
using Zygote
using SparseArrays

logpos(a) = a > 0 ? log(a) : zero(a)
l(A) = sum(logpos.(A))

A = sprandn(Float32, 10,10,0.4)

dA = gradient(l, A)[1] # returns a sparse array, as expected

Acu = sparse(CuArray(A))
dAcu = gradient(l, Acu)[1] # returns a dense CuArray

Maybe someone here has an idea of where could this come from?

@vboussange vboussange changed the title CuArray instead of SparseCuArray returned during differentiation Dense instead of sparse matrix returned during differentiation Nov 3, 2024
@ToucheSir
Copy link
Member

I suspect the issue is with missing rule(s) in ChainRules since Zygote has almost nothing in the way of machinery for diffing sparse arrays. What happens if you set CUDA.allowscalar(false) before running the MWE?

@ToucheSir ToucheSir added the CUDA All things GPU label Nov 5, 2024
@vboussange
Copy link
Author

What happens if you set CUDA.allowscalar(false) before running the MWE?

No error thrown, same dense array returned.

@ToucheSir
Copy link
Member

Ok, I dug into this a bit more and have a reduced MWE:

julia> Zygote.unbroadcast(A, collect(A)) |> summary
"10×10 SparseMatrixCSC{Float32, Int64} with 41 stored entries"

julia> Zygote.unbroadcast(Acu, cu(collect(Acu))) |> summary
"10×10 CuArray{Float32, 2, CUDA.DeviceMemory}"

unbroadcast is an internal function used by the broadcast rules, and the divergence between CPU and GPU happens here:

_project(x, x̄) # ProjectTo handles reshape, offsets, structured matrices, row vectors
.

The gist is that the actual gradient computed from the broadcast is dense. However, the CPU path knows how to "project" that back into a sparse gradient so the end result is sparse. This is done via the projection machinery in ChainRulesCore.jl. ChainRulesCore has projection machinery for CPU sparse matrices, but not GPU ones.

So ultimately, I'd say there are two ways to look at this. One is that returning the sparse matrix isn't saving much and arguably using more compute. Re-sparsifying the dense gradients yourself afterwards might work well enough. The other interpretation is that this is a missing projection rule in ChainRulesCore, in which case this may be worth a feature request.

@ToucheSir ToucheSir added the ChainRules adjoint -> rrule, and further integration label Nov 11, 2024
@vboussange
Copy link
Author

Thanks for digging into this! I may request a feature to ChainRulesCore then. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ChainRules adjoint -> rrule, and further integration CUDA All things GPU
Projects
None yet
Development

No branches or pull requests

2 participants