Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local tracers #56

Closed
Tracked by #57
adrhill opened this issue May 8, 2024 · 27 comments · Fixed by #59 or #65
Closed
Tracked by #57

Add local tracers #56

adrhill opened this issue May 8, 2024 · 27 comments · Fixed by #59 or #65

Comments

@adrhill
Copy link
Owner

adrhill commented May 8, 2024

Currently, all tracers return conservative global sparsity patterns.
Local sparsity can be computed by propagating a primal value as well.

@gdalle
Copy link
Collaborator

gdalle commented May 9, 2024

See #57

@adrhill
Copy link
Owner Author

adrhill commented May 14, 2024

Reopening since #59 didn't add local tracers.

@adrhill adrhill reopened this May 14, 2024
@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

Operators with local behavior:

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

Once we have primal values, we will also need to add new operators, mostly those that work with Bool:

  • isequal, isless, isfinite, isinf, isnan
  • possibly their friends ==, <, <= and co (check the fallback structure for those again)

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

In a potential NNLib extension, we could add global and local classification of activation functions. Here are the ones with local sparsity:

  • hardsigmoid
  • hardtanh
  • relu
  • relu6
  • softshrink
  • hardswish
  • trelu

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

Oh and of course we need to add ifelse:

function Base.ifelse(b::Bool, xt::GlobalTracer, yt::GlobalTracer)
    return GlobalTracer(union(xt.inputs, yt.inputs))
end

function Base.ifelse(b::Bool, xt::LocalTracer, yt::LocalTracer)
    return ifelse(b, xt, yt)
end

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

Do we need to reintroduce local sparsity when we:

  • multiply by zero?
  • threshold to zero?

My answer would be yes, but only if the zero in question is a constant (not a variable which we are also tracing).

Then the question becomes: do we ever encounter non-traced numbers during tracing? It seems unlikely, since we're converting everything to tracers.

@adrhill
Copy link
Owner Author

adrhill commented May 15, 2024

The current structure of our code would make it trivial to support the zero-product property and the additive identity.

And we could generally return empty tracers if the primal computation returns zero.

threshold to zero

What do you mean by this?

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

And we could generally return empty tracers if the primal computation returns zero.

I'm worried we need to be extra careful when the zeros are not fixed but variables. For instance, we don't want to return an empty tracer for

f(x) = x .* x
jacobian_sparsity(f, [0.0])

even though there is a multiplication by zero occurring in the primal computation

@adrhill
Copy link
Owner Author

adrhill commented May 15, 2024

we don't want to return an empty tracer

An argument could be made that the "global" Jacobian of f is 2x and that the "local" Jacobian of f evaluated at x=0 therefore is zero.

@adrhill
Copy link
Owner Author

adrhill commented May 15, 2024

In other words, the local tracer could compute

$\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial_\alpha \varphi\vert_\alpha] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial_\beta \varphi\vert_\beta] \cdot \mathbb{1}[\nabla \beta]$

instead of

$\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial_\alpha \varphi] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial_\beta \varphi] \cdot \mathbb{1}[\nabla \beta] \quad .$

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

This is already what the theory means, no need for the $\vert_\alpha$ in addition. What you're suggesting is something else that I can't quite put my finger on

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

What about

f(x) = x[1] * x[2]
jacobian_sparsity(f, [0.0, 1.0])

In the old world, we trace both inputs, and the output gets a dependency on both.

In the new world, we would only trace the second input? The first one gets an empty tracer from the start?

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

Basically, our global tracers $\mathbb{1}[\nabla \alpha] \neq 0$ mean "the gradient is not uniformly zero".
Our local tracers (the way we saw them until now) mean "the gradient is not uniformly zero on a neighborhood of the current point". You want to restrict them further to "the gradient is not zero at the current point".

@adrhill
Copy link
Owner Author

adrhill commented May 15, 2024

Right, what I'm suggesting is to give an estimate of the sparsity of $J_f\vert_x$ that is as non-conservative as possible. Counter-intuitively, this could actually make the code run faster.

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

Right but if we give every zero value an empty tracer, that boils down to saying "when the primal value is zero, every coefficient of the gradient / Hessian that involves the corresponding variable evaluates to zero at the current point". Which I am almost certain is wrong

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

I would steer clear of "practical speedups" that are not first validated by theory

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

f(x) = exp(x[1]) * x[2]
jacobian_sparsity(f, [0.0, 1.0])

If we give an empty tracer to the first index, we get something wrong regardless of what we do next

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

Essentially, even when we have zeros in our computational graph, we can't afford to forget where they come from. Zeros are only special at the end of the graph

@adrhill
Copy link
Owner Author

adrhill commented May 15, 2024

I think you're misreading my argument. You said:

we don't want to return an empty tracer for

f(x) = x .* x
jacobian_sparsity(f, [0.0])

even though there is a multiplication by zero occurring in the primal computation

My point is that in this case, as written explicitly above, $\mathbb{1}[\partial_\alpha \varphi\vert_\alpha] = \mathbb{1}[\partial_\beta \varphi\vert_\beta] = 0$, so an empty tracer can be returned.

In the case of

f(x) = exp(x[1]) * x[2]
jacobian_sparsity(f, [0.0, 1.0])

$\mathbb{1}[\partial_\alpha \varphi\vert_\alpha] \neq 0$, so according to the theory, the resulting tracer isn't empty.

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

Yeah that was more of an answer to your comment on Slack, which we now both agree is wrong. I think your general idea however is right, and it only requires being more precise with the derivatives of the operators $\varphi$. The question is how precise.

For instance, derivatives of cos are zero on all of $\mathbb{Z}\pi$, but that's pretty useless in practice cause we have a low chance of hitting those primal values.
On the other hand, the derivative of abs2 at zero is zero, and that's one we're relatively likely to hit.

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

And what we really don't want to screw up is operators with more than one valid derivative, aka points of nondifferentiability. For instance, abs is not differentiable at zero, so what each autodiff backend returns is an implementation detail. It doesn't have to be 0, it doesn't even have to be in the subdifferential [-1, 1]. In that case, $\mathbb{1}[\partial \varphi]$ should only be 0 if all acceptable derivatives of the operator are zero, and one otherwise.

@adrhill
Copy link
Owner Author

adrhill commented May 15, 2024

How about for

f(x) = exp(x[1] * x[2])
jacobian_sparsity(f, [0.0, 1.0])

When we compute the product x[1] * x[2], even though x[1] == 0.0, we cannot ditch the tracing for either argument, otherwise the final sparsity will be wrong

I suggest we just look at f(x) = x[1] * x[2] at [0.0 1.0], since we overload at the operator level.

The corresponding Jacobian is $J_f = [x_2, x_1]$ and therefore $J_f\vert_x = [1, 0]$.

Theory tells us $\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial_\alpha \varphi] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial_\beta \varphi] \cdot \mathbb{1}[\nabla \beta] = \mathbb{1}[\nabla \alpha]$, so somewhat unintuitively, only the gradient information from $x_1$ is propagated, resulting in [1 0], matching the sparsity pattern of the Jacobian above.

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

The corresponding Jacobian

You mean gradient.

And yes, this theory would be implemented by the following functions

nonzero_der1_arg1(*, a, b) = !iszero(b)
nonzero_der1_arg2(*, a, b) = !iszero(a)

@gdalle
Copy link
Collaborator

gdalle commented May 15, 2024

I think we should add tests for these finer internal functions with ForwardDiff, instead of testing only the global classification of operators. They are very easy to screw up without realizing.

@adrhill
Copy link
Owner Author

adrhill commented May 15, 2024

think we should add tests for these finer internal functions with ForwardDiff, instead of testing only the global classification of operators. They are very easy to screw up without realizing.

Opened #66 to track this.

@adrhill
Copy link
Owner Author

adrhill commented May 15, 2024

Once we have primal values, we will also need to add new operators, mostly those that work with Bool:

  • isequal, isless, isfinite, isinf, isnan
  • possibly their friends ==, <, <= and co (check the fallback structure for those again)

As a friendly reminder to myself tomorrow: we should also test on @Vaibhavdixit02's matrix exponentials.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants