Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AD with ForwardMode is not thread-safe #6

Closed
sethaxen opened this issue Jan 31, 2023 · 10 comments · Fixed by #8
Closed

AD with ForwardMode is not thread-safe #6

sethaxen opened this issue Jan 31, 2023 · 10 comments · Fixed by #8

Comments

@sethaxen
Copy link

When using multiple threads, the gradient computations disagree with the gradients one gets when using a single thread. Here's an MWE:

using ForwardDiff, LinearAlgebra, LogDensityProblems, LogDensityProblemsAD

struct LogDensityFunction{F}
    logp::F
    dim::Int
end
function LogDensityProblems.capabilities(::Type{<:LogDensityFunction})
    return LogDensityProblems.LogDensityOrder{0}()
end
LogDensityProblems.dimension(ℓ::LogDensityFunction) =.dim
LogDensityProblems.logdensity(ℓ::LogDensityFunction, x) =.logp(x)

dim = 10
n = 10^6
P = LogDensityFunction(x -> -sum(abs2, x) / 2, dim)
ℓ = ADgradient(:ForwardDiff, P)
x = randn(dim, n)

results_single = zeros(dim, n)
results_threaded = zeros(dim, n)

for j in 1:n
    results_single[:, j] = LogDensityProblems.logdensity_and_gradient(ℓ, x[:, j])[2]
end

Threads.@threads for j in 1:n
    results_threaded[:, j] = LogDensityProblems.logdensity_and_gradient(ℓ, x[:, j])[2]
end

On my machine (using 4 threads), I get the following result:

julia> norm(results_single .- results_threaded)
2588.220046954836

This appears to be the same issue as JuliaDiff/ForwardDiff.jl#573, where the signature ForwardDiff.gradient(f, x, cfg), which is used here, is shown to be not thread-safe. By comparison, ReverseDiff seems to be fine.

@sethaxen
Copy link
Author

I suspect the issue is with a shared GradientConfig. AbstractDifferentiation creates a GradientConfig in its gradient function and I don't think suffers from this issue: https://github.com/JuliaDiff/AbstractDifferentiation.jl/blob/eb5d913b9e4cbd31465af4ee2a75acb7f69ded91/src/forwarddiff.jl#LL47C57-L47C57

@devmotion
Copy link
Collaborator

IMO that seems to be a ForwardDiff issue? At least it seems somewhat surprising that gradient evaluation mutates the gradient config but that seems what happens here: https://github.com/JuliaDiff/ForwardDiff.jl/blob/4b143a199f541e7c1248dc5bc29b397be938e81d/src/apiutils.jl#L34-L46

@devmotion
Copy link
Collaborator

Regarding AbstractDifferentiation, if I remember correctly the only reason why it creates a config at all is to set the chunk size. Otherwise they would not be needed.

@sethaxen
Copy link
Author

At least it seems somewhat surprising that gradient evaluation mutates the gradient config but that seems what happens here: https://github.com/JuliaDiff/ForwardDiff.jl/blob/4b143a199f541e7c1248dc5bc29b397be938e81d/src/apiutils.jl#L34-L46

This is the documented behavior. It seems the main purpose of the config is to allocate work buffers: https://juliadiff.org/ForwardDiff.jl/stable/user/api/#Preallocating/Configuring-Work-Buffers. See also this comment: JuliaDiff/ForwardDiff.jl#573 (comment)

@devmotion
Copy link
Collaborator

I see. Then probably we should add a note in LogDensityProblemsAD as well that it is not threadsafe.

@devmotion
Copy link
Collaborator

Since it is much more efficient to preallocate a config, as stated in the ForwardDiff docs

ForwardDiff's basic API methods will allocate these types automatically by default, but you can drastically reduce memory usage if you preallocate them yourself.

I think generally one wants to create and use the config object, also within LogDensityProblemsAD.
To me it seems that it's not thread-safe seems to be a design (issue?) of ForwardDiff, so users should be aware of it; but making it inefficient for everyone - similar to the code in AbstractDifferentiation - even though here in contrast to AbstractDifferentiation we know the function AND the size of the input, seems undesirable to me.

@tpapp
Copy link
Owner

tpapp commented Jan 31, 2023

I can think of the following (not mutually exclusive) solutions:

  1. warn users about this, suggesting that they create an AD'd object for each thread, but that only covers simple use cases like MCMC (however, that may just work for 99% of our users),
  2. provide an option w/o preallocated buffers, slower but safe

Or perhaps allocate a buffer for each thread, and have the call write to it based on Threads.threadid(), but I have to think through the invariants.

@devmotion
Copy link
Collaborator

devmotion commented Jan 31, 2023

Or perhaps allocate a buffer for each thread, and have the call write to it based on Threads.threadid(), but I have to think through the invariants.

That is not guaranteed to work anymore in newer Julia versions, generally it is not safe to use Threads.threadid() in such cases. See e.g. https://juliafolds.github.io/FLoops.jl/stable/explanation/faq/#faq-state-threadid and the discussion in https://discourse.julialang.org/t/behavior-of-threads-threads-for-loop/76042?u=devmotion.

@devmotion
Copy link
Collaborator

provide an option w/o preallocated buffers, slower but safe

Maybe the easiest option here would be to support passing gradientconfig = nothing in ADgradient, and handling it by constructing a new config object (to support setting chunk size nevertheless) when calling logdensity_and_gradient.

@tpapp
Copy link
Owner

tpapp commented Feb 12, 2023

I would suggest the following solution:

  1. make the default threadsafe, if potentially inefficient, for all backends and document this,
  2. offer the faster solution in ADgradient for the relevant backends, explicitly mentioning that it is not thread-safe,
  3. implement a method for Base.copy that allows independently usable copies where applicable (eg gradients created with the ForwardDiff backend), and is otherwise a no-op.

This would allow users to take care of threading as they wish --- most MCMC applications would just create as many copies as they need, and sample in threads.

Incidentally, I do not understand how ReverseDiff is not affected. It uses the same kind of buffers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants