Add local tracers #56

adrhill · 2024-05-08T14:08:23Z

Currently, all tracers return conservative global sparsity patterns.
Local sparsity can be computed by propagating a primal value as well.

gdalle · 2024-05-09T10:26:27Z

See #57

adrhill · 2024-05-14T16:38:15Z

Reopening since #59 didn't add local tracers.

gdalle · 2024-05-15T13:43:18Z

Operators with local behavior:

min and max: only one of the argument matters at any given point => min(x, bigy) = x
mod and rem: depending on their relative sizes and signs, one of the arguments may not matter => mod(x, bigy) = x
same with a subset of https://docs.julialang.org/en/v1.10/manual/mathematical-operations/#Division-functions

gdalle · 2024-05-15T13:43:36Z

Once we have primal values, we will also need to add new operators, mostly those that work with Bool:

isequal, isless, isfinite, isinf, isnan
possibly their friends ==, <, <= and co (check the fallback structure for those again)

gdalle · 2024-05-15T13:49:39Z

In a potential NNLib extension, we could add global and local classification of activation functions. Here are the ones with local sparsity:

hardsigmoid
hardtanh
relu
relu6
softshrink
hardswish
trelu

gdalle · 2024-05-15T13:51:54Z

Oh and of course we need to add ifelse:

function Base.ifelse(b::Bool, xt::GlobalTracer, yt::GlobalTracer)
    return GlobalTracer(union(xt.inputs, yt.inputs))
end

function Base.ifelse(b::Bool, xt::LocalTracer, yt::LocalTracer)
    return ifelse(b, xt, yt)
end

gdalle · 2024-05-15T13:56:58Z

Do we need to reintroduce local sparsity when we:

multiply by zero?
threshold to zero?

My answer would be yes, but only if the zero in question is a constant (not a variable which we are also tracing).

Then the question becomes: do we ever encounter non-traced numbers during tracing? It seems unlikely, since we're converting everything to tracers.

adrhill · 2024-05-15T14:05:21Z

The current structure of our code would make it trivial to support the zero-product property and the additive identity.

And we could generally return empty tracers if the primal computation returns zero.

threshold to zero

What do you mean by this?

gdalle · 2024-05-15T14:08:35Z

And we could generally return empty tracers if the primal computation returns zero.

I'm worried we need to be extra careful when the zeros are not fixed but variables. For instance, we don't want to return an empty tracer for

f(x) = x .* x
jacobian_sparsity(f, [0.0])

even though there is a multiplication by zero occurring in the primal computation

adrhill · 2024-05-15T14:26:19Z

we don't want to return an empty tracer

An argument could be made that the "global" Jacobian of f is 2x and that the "local" Jacobian of f evaluated at x=0 therefore is zero.

adrhill · 2024-05-15T16:18:37Z

In other words, the local tracer could compute

$\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial_\alpha \varphi\vert_\alpha] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial_\beta \varphi\vert_\beta] \cdot \mathbb{1}[\nabla \beta]$

instead of

$\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial_\alpha \varphi] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial_\beta \varphi] \cdot \mathbb{1}[\nabla \beta] \quad .$

gdalle · 2024-05-15T17:18:09Z

This is already what the theory means, no need for the $\vert_\alpha$ in addition. What you're suggesting is something else that I can't quite put my finger on

gdalle · 2024-05-15T17:19:58Z

What about

f(x) = x[1] * x[2]
jacobian_sparsity(f, [0.0, 1.0])

In the old world, we trace both inputs, and the output gets a dependency on both.

In the new world, we would only trace the second input? The first one gets an empty tracer from the start?

gdalle · 2024-05-15T17:28:17Z

Basically, our global tracers $\mathbb{1}[\nabla \alpha] \neq 0$ mean "the gradient is not uniformly zero".
Our local tracers (the way we saw them until now) mean "the gradient is not uniformly zero on a neighborhood of the current point". You want to restrict them further to "the gradient is not zero at the current point".

adrhill · 2024-05-15T17:31:41Z

Right, what I'm suggesting is to give an estimate of the sparsity of $J_f\vert_x$ that is as non-conservative as possible. Counter-intuitively, this could actually make the code run faster.

gdalle · 2024-05-15T17:33:23Z

Right but if we give every zero value an empty tracer, that boils down to saying "when the primal value is zero, every coefficient of the gradient / Hessian that involves the corresponding variable evaluates to zero at the current point". Which I am almost certain is wrong

gdalle · 2024-05-15T17:33:54Z

I would steer clear of "practical speedups" that are not first validated by theory

gdalle · 2024-05-15T17:35:07Z

f(x) = exp(x[1]) * x[2]
jacobian_sparsity(f, [0.0, 1.0])

If we give an empty tracer to the first index, we get something wrong regardless of what we do next

gdalle · 2024-05-15T17:40:49Z

Essentially, even when we have zeros in our computational graph, we can't afford to forget where they come from. Zeros are only special at the end of the graph

adrhill · 2024-05-15T17:46:20Z

I think you're misreading my argument. You said:

we don't want to return an empty tracer for
f(x) = x .* x
jacobian_sparsity(f, [0.0])
even though there is a multiplication by zero occurring in the primal computation

My point is that in this case, as written explicitly above, $\mathbb{1}[\partial_\alpha \varphi\vert_\alpha] = \mathbb{1}[\partial_\beta \varphi\vert_\beta] = 0$, so an empty tracer can be returned.

In the case of

f(x) = exp(x[1]) * x[2]
jacobian_sparsity(f, [0.0, 1.0])

$\mathbb{1}[\partial_\alpha \varphi\vert_\alpha] \neq 0$, so according to the theory, the resulting tracer isn't empty.

gdalle · 2024-05-15T17:56:23Z

Yeah that was more of an answer to your comment on Slack, which we now both agree is wrong. I think your general idea however is right, and it only requires being more precise with the derivatives of the operators $\varphi$. The question is how precise.

For instance, derivatives of cos are zero on all of $\mathbb{Z}\pi$, but that's pretty useless in practice cause we have a low chance of hitting those primal values.
On the other hand, the derivative of abs2 at zero is zero, and that's one we're relatively likely to hit.

gdalle · 2024-05-15T17:58:36Z

And what we really don't want to screw up is operators with more than one valid derivative, aka points of nondifferentiability. For instance, abs is not differentiable at zero, so what each autodiff backend returns is an implementation detail. It doesn't have to be 0, it doesn't even have to be in the subdifferential [-1, 1]. In that case, $\mathbb{1}[\partial \varphi]$ should only be 0 if all acceptable derivatives of the operator are zero, and one otherwise.

adrhill · 2024-05-15T18:03:04Z

How about for
f(x) = exp(x[1] * x[2])
jacobian_sparsity(f, [0.0, 1.0])
When we compute the product x[1] * x[2], even though x[1] == 0.0, we cannot ditch the tracing for either argument, otherwise the final sparsity will be wrong

I suggest we just look at f(x) = x[1] * x[2] at [0.0 1.0], since we overload at the operator level.

The corresponding Jacobian is $J_f = [x_2, x_1]$ and therefore $J_f\vert_x = [1, 0]$.

Theory tells us $\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial_\alpha \varphi] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial_\beta \varphi] \cdot \mathbb{1}[\nabla \beta] = \mathbb{1}[\nabla \alpha]$, so somewhat unintuitively, only the gradient information from $x_1$ is propagated, resulting in [1 0], matching the sparsity pattern of the Jacobian above.

gdalle · 2024-05-15T18:05:31Z

The corresponding Jacobian

You mean gradient.

And yes, this theory would be implemented by the following functions

nonzero_der1_arg1(*, a, b) = !iszero(b)
nonzero_der1_arg2(*, a, b) = !iszero(a)

gdalle · 2024-05-15T18:07:48Z

I think we should add tests for these finer internal functions with ForwardDiff, instead of testing only the global classification of operators. They are very easy to screw up without realizing.

adrhill · 2024-05-15T20:52:38Z

think we should add tests for these finer internal functions with ForwardDiff, instead of testing only the global classification of operators. They are very easy to screw up without realizing.

Opened #66 to track this.

adrhill · 2024-05-15T20:54:43Z

Once we have primal values, we will also need to add new operators, mostly those that work with Bool:

isequal, isless, isfinite, isinf, isnan

possibly their friends ==, <, <= and co (check the fallback structure for those again)

As a friendly reminder to myself tomorrow: we should also test on @Vaibhavdixit02's matrix exponentials.

adrhill mentioned this issue May 8, 2024

Add overloads for Base functions needed for tracing through matrix exponential from ExponentialUtilities #55

Closed

This was referenced May 9, 2024

Refactor following theory #57

Closed

Refactor code to match theory #59

Merged

adrhill closed this as completed in #59 May 14, 2024

adrhill reopened this May 14, 2024

gdalle mentioned this issue May 15, 2024

Operators with local sparsity > global sparsity #64

Closed

adrhill mentioned this issue May 15, 2024

Add local tracers #65

Merged

adrhill closed this as completed in #65 May 17, 2024

adrhill mentioned this issue May 22, 2024

Add NNlib extension #85

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local tracers #56

Add local tracers #56

adrhill commented May 8, 2024

gdalle commented May 9, 2024

adrhill commented May 14, 2024

gdalle commented May 15, 2024 •

edited

Loading

gdalle commented May 15, 2024

gdalle commented May 15, 2024 •

edited

Loading

gdalle commented May 15, 2024 •

edited

Loading

gdalle commented May 15, 2024 •

edited

Loading

adrhill commented May 15, 2024

gdalle commented May 15, 2024

adrhill commented May 15, 2024

adrhill commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024 •

edited

Loading

adrhill commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024 •

edited

Loading

gdalle commented May 15, 2024

adrhill commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024

adrhill commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024 •

edited

Loading

adrhill commented May 15, 2024

adrhill commented May 15, 2024

Add local tracers #56

Add local tracers #56

Comments

adrhill commented May 8, 2024

gdalle commented May 9, 2024

adrhill commented May 14, 2024

gdalle commented May 15, 2024 • edited Loading

gdalle commented May 15, 2024

gdalle commented May 15, 2024 • edited Loading

gdalle commented May 15, 2024 • edited Loading

gdalle commented May 15, 2024 • edited Loading

adrhill commented May 15, 2024

gdalle commented May 15, 2024

adrhill commented May 15, 2024

adrhill commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024 • edited Loading

adrhill commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024 • edited Loading

gdalle commented May 15, 2024

adrhill commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024

adrhill commented May 15, 2024

gdalle commented May 15, 2024

gdalle commented May 15, 2024 • edited Loading

adrhill commented May 15, 2024

adrhill commented May 15, 2024

gdalle commented May 15, 2024 •

edited

Loading

gdalle commented May 15, 2024 •

edited

Loading

gdalle commented May 15, 2024 •

edited

Loading

gdalle commented May 15, 2024 •

edited

Loading

gdalle commented May 15, 2024 •

edited

Loading

gdalle commented May 15, 2024 •

edited

Loading

gdalle commented May 15, 2024 •

edited

Loading