Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoDiff #331

Closed
emilemathieu opened this issue Aug 18, 2017 · 29 comments
Closed

AutoDiff #331

emilemathieu opened this issue Aug 18, 2017 · 29 comments

Comments

@emilemathieu
Copy link
Collaborator

Hi all !

I was wondering, why did you choose to use chunk-wise forward AD, and not use http://www.juliadiff.org ?

Best,
Emile

@emilemathieu
Copy link
Collaborator Author

My bad, it seems to be indeed using ForwardDiff.

@yebai
Copy link
Member

yebai commented Aug 18, 2017

My bad, it seems to be indeed using ForwardDiff

That's right, we are using ForwardDiff.

@emilemathieu
Copy link
Collaborator Author

In the wiki it is written:
Turing.jl is currently using ForwardDiff.jl for automatic differentiation, which is believed to be the main reason of slowness of Turing.jl. This AD backend will be replaced by a reverse-mode AD implementation when Julia has a mature one.

What about ReverseDiff.jl or https://github.com/denizyuret/AutoGrad.jl @yebai ?

@emilemathieu emilemathieu reopened this Aug 28, 2017
@yebai
Copy link
Member

yebai commented Aug 29, 2017

What about ReverseDiff.jl or https://github.com/denizyuret/AutoGrad.jl

Yes we've looked into this library. I think @xukai92 also did some benchmarking on its performance. The result does suggest backward AD is more efficient than forward AD on big models. Unfortunately this library is currently not compatible with Distributions.jl, and lacks support for many common numerical functions.

@emilemathieu
Copy link
Collaborator Author

Thanks for your answer. And what about ReverseDiff.jl ?

@ChrisRackauckas
Copy link
Collaborator

ReverseDiff.jl also has a very hard upper bound on ForwardDiff v0.5 which will "break" (or at least pull back) many libraries like Optim.jl, DifferentialEquations.jl, NLsolve.jl, ... i.e. I wouldn't even suggest having it installed anymore since it will cause a ton of problems.

AutoGrad has some issues with broadcast-composed functions since it uses dispatch overloading. So I think it's still stuck on v0.5.

The true answer will be Cassette.jl when it comes out. I suggest just sticking to ForwardDiff.jl until it does because the current solutions are somewhat fragile.

@emilemathieu
Copy link
Collaborator Author

In our case, we want to compute the gradient of the joint probability which is a function from the cartesian product of the model random variables' support to R, therefore R^n -> R with n=sum{n_i} (n_i being the dimensionality of the ith rv). Reverse differentiation should be more efficient than forward differentiation from a theoretical point of view, isn't it ?

@ChrisRackauckas Do you know whether Jarrett will rewrite ReverseDiff.jl using Cassette.jl, or Cassette.jl itself will allow reverse differentiation ? Would you also have an idea of when Cassette.jl is planned to be in beta ?

@emilemathieu emilemathieu reopened this Nov 9, 2017
@ChrisRackauckas
Copy link
Collaborator

Cassette.jl is just a graph construction algorithm. ReverseDiff.jl will be rewritten to use Cassette.jl. I don't know of any timelines, other than "soon enough".

@yebai
Copy link
Member

yebai commented Jan 18, 2018

Hi @ChrisRackauckas, We want to make some plans for backward autodiff integration. There are also some updates from both the Cassette and AutoGrad packages side. Most notably,

  • it seems that AutoGrad now works for Julia 0.6
  • Cassette won’t directly support autodiff anymore (judging from the repo’s readme file)

What are your current thoughts/suggestions on this issue? Are there any hard reasons why we can’t merge the AutoGrad package and ReverseDiff package since they overlap so much? Thanks!

@xukai92
Copy link
Member

xukai92 commented Jan 18, 2018

@yebai I was looking into adapting AutoGrad, with some discussions with their developer here: denizyuret/AutoGrad.jl#42

@xukai92
Copy link
Member

xukai92 commented Jan 18, 2018

In the end they say:

I understand. Rec <: Real in the general case may break other use cases.
However you could create a fork/branch of AutoGrad.jl with Rec <: Real as a
short term solution. @jrevels is working on a Cassette based solution
where a Rec type will not be necessary and this problem will be solved.

However I haven't managed to make my proposal of subtyping Rec work yet..

@ChrisRackauckas
Copy link
Collaborator

Cassette is a contextual dispatch system which can make autodifferentiation. Capstan.jl is where that implementation will live.

Are there any hard reasons why we can’t merge the AutoGrad package and ReverseDiff package since they overlap so much? Thanks!

Autograd requires registered primatives. ReverseDiff using the dispatch mechanism to push derivative information back. They are conceptually different.

Given that ReverseDiff.jl won't be going away since the Cassette work is now going to stay in a separate repo, and that ReverseDiff.jl now no longer has any hard upper bounds, I would suggest trying to get ReverseDiff.jl working. It'll go legacy at some point but should continue to work.

@xukai92
Copy link
Member

xukai92 commented Jan 18, 2018

Thanks for the answer!

I'm a bit confused about your points on the difference between AutoGrad.jl and ReverseDiff.jl. Do you mean ReverseDiff.jl doesn't need any primitives at all?

Another two question on ReverseDiff.jl

  • Is it compatible with Distributions.jl
  • Is it GPU supported?

@ChrisRackauckas
Copy link
Collaborator

Is it GPU supported?

In theory it supports any AbstractArray. So yes, in theory. In practice, it's not well-tested.

Is it compatible with Distributions.jl

It should be, since it would get the derivatives forward there via ForwardDiff.

I'm a bit confused about your points on the difference between AutoGrad.jl and ReverseDiff.jl. Do you mean ReverseDiff.jl doesn't need any primitives at all?

It has a set of generic primatives via dual numbers of ForwardDiff.jl, and uses that to compose functions together. AutoGrad is much more strict on the functional form it allows and requires registering a lot of higher level functions.

@yebai
Copy link
Member

yebai commented Jan 18, 2018

Very helpful - thank all for the answers!

@ChrisRackauckas what’s the current plan for Capstan, e.g. is there any estimated (even informally) release time? I’ve heard that Julia 1.0 is coming out soon from one Julia creator. Could we help to speed up the development of Capstan?

@ChrisRackauckas
Copy link
Collaborator

It already works, but it's slow as mentioned here JuliaLabs/Cassette.jl#5 due to JuliaLang/julia#5402 . That's planned to be fixed in a Julia v1.x, so if you don't want to wait then just setup ReverseDiff.jl for now.

@jrevels
Copy link

jrevels commented Jan 18, 2018

Just saw this thread; my comment here might be relevant.

Just FYI, I don't really plan on doing much ReverseDiff maintenance in the future, so that I can focus on Cassette/Capstan stuff. Once Capstan is released (or even just in beta), I'd be happy to help Turing.jl put it through its paces.

@xukai92
Copy link
Member

xukai92 commented Jan 19, 2018

There is still a difficulty using ReverseDiff.jl with Distributions.jl - it's not possible to construct a distribution with ReverseDiff.TrackedArray.

E.g. I cannot AD through f below

f(x) = logpdf.(Normal(x, 1), 1)

This is simply because Distritbuion.jl only support constructing distributions with Real, in which case ForwardDiff works as Dual <: Real.

@yebai
Copy link
Member

yebai commented Jan 24, 2018

@xukai92 The following seems work for me. I guess the issue is that ReverseDiff.GradientConfig only accept array-type inputs.

using ReverseDiff: GradientTape, GradientConfig, gradient, gradient!, compile
using Distributions

f(x) = logpdf.(Normal(x[1], 1), 1) # x is an array
gradient(f, [1.0]) 

@xukai92
Copy link
Member

xukai92 commented Jan 24, 2018

Emmm. I see what you the trick. Then it seems that in our Turing.jl program one should write something like this

@model test() = begin
   x ~ Normal(0, 1)
   1 ~ Normal(x[1], 1)
end

Need to test it on another computer of mine later - will update when I do that.

@yebai
Copy link
Member

yebai commented Jan 24, 2018

@ChrisRackauckas @jrevels Are there any APIs for passing scalar to target function when using ReverseDiff? (see the comment from Kai: #331 (comment))

@jrevels
Copy link

jrevels commented Jan 24, 2018

Are there any APIs for passing scalar to target function when using ReverseDiff?

No, see JuliaDiff/ReverseDiff.jl#40.

You can, of course, always (un)wrap the scalar input e.g.

x = rand()
y = rand(4)
f(x::Real, y::AbstractArray) = # some implementation returning a scalar
ReverseDiff.gradient((x, y) -> f(x[1], y), (fill(x), y))

@xukai92
Copy link
Member

xukai92 commented Jan 25, 2018

@yebai The test below works on my local Turing.jl by simply changing all the types in VarInfo which were Real to Any when necessary (I pushed my local temporarily-fixed code in the branch reverse-diff)

using ReverseDiff: GradientTape, GradientConfig, gradient, gradient!, compile

using Turing

@model gdemo(x) = begin
  s ~ InverseGamma(2,3)
  m ~ Normal(0,sqrt(s[1]))
  x[1] ~ Normal(m[1], sqrt(s[1]))
  x[2] ~ Normal(m[1], sqrt(s[1]))
  return s, m
end

mf = gdemo([1.5, 2.0])
vi = mf()

f_turing(x) = begin
    print(typeof(x))
    vi.vals[1] = x
    mf(vi).logp
end

gradient(f_turing, ([1.0]))

@xukai92
Copy link
Member

xukai92 commented Feb 22, 2018

@jrevels I managed to use ReverseDiff.jl in Turing.jl now (branch=https://github.com/yebai/Turing.jl/tree/reverse-diff)

When I integrate it with Turing.jl, I experienced a huge performance change by using a local variable to accumulate log-probability (d641f42) instead of accumulating log-probability using a field of a composite-type variable. To be more specific, when I increase the dimensionality, the time used to get AD increases 100x slower than before.

The change I made is something like

type LP
    lp::Union{Float64,ReverseDiff.TrackedReal}
end

acc!(x::Union{Float64,ReverseDiff.TrackedReal}, d::Distribution, lp::LP) = lp.lp += logpdf(d, x)

lp = LP(0.0)

f3(x) = begin
    lp.lp = 0.0
    acc!(x[1],  Normal(0, 1), lp)
    for i = 2:length(x)
        acc!(x[i], Normal(x[i-1], 1), lp)
    end
    lp.lp
end

to

lp = LP(0.0)

f1(x) = begin
    _lp::Union{Float64,ReverseDiff.TrackedReal} = 0.0
    _lp += logpdf(Normal(0, 1), x[1])
    for i = 2:length(x)
        _lp += logpdf(Normal(x[i-1], 1), x[i])
    end
    lp.lp = _lp
    lp.lp
end

I didn't understand why that happened and also didn't managed to get a simplest case to show the difference. I wonder how ReverseDiff.jl is sensitive to the following situations?

  • Nested function
  • Unstable type

@ChrisRackauckas
Copy link
Collaborator

ChrisRackauckas commented Feb 22, 2018

type LP
    lp::Union{Float64,ReverseDiff.TrackedReal}
end

that's not type stable to access. Use a type parameter

type LP{T}
    lp::T
end

@xukai92
Copy link
Member

xukai92 commented Feb 22, 2018

Thanks Chirs. Is that true if it's a local variable Julia might be able to infer the type?

@ChrisRackauckas
Copy link
Collaborator

If the code is type-stable, yes. Those type markings you're putting everywhere, except the one I mentioned, are unnecessary for type inference.

@xukai92
Copy link
Member

xukai92 commented Feb 22, 2018

I see. A follow-up question is that how much performance of ReverseDiff.jl's is influenced by un-stable type? Using the example above, I only get ~3x speed-up. But what I experienced in Turing is much larger. Is it the fact that if part of the code is type unstable, all other function calls may be influenced?

@ChrisRackauckas
Copy link
Collaborator

Yes, inference problems "leak". If you don't know the type of a, then you don't know the type of c = a+b, and then ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants