Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy ReLU operation #1176

Open
schillic opened this issue Feb 28, 2019 · 25 comments
Open

Lazy ReLU operation #1176

schillic opened this issue Feb 28, 2019 · 25 comments

Comments

@schillic
Copy link
Member

schillic commented Feb 28, 2019

Given a set X, the set Relu(X) respresents the set {max.(x, 0) | x ∈ X}.
For a convex set X, Relu(X) is not necessarily convex (take X as the 2D line segment from (0, -1) to (2, 1).

For a start, we only need the support vector. The first idea is to just reset all negative entries in the support vector of X to zero. We would need to check if that is actually correct.

@mforets
Copy link
Member

mforets commented Mar 1, 2019

CC: @frehseg, @tomerarnon

@schillic
Copy link
Member Author

This is not correct. Take the line segment below.

line_segment

The ReLU set is the following.

relu

The support vector in direction [1, 1] would be the point to the right.

@tomerarnon
Copy link
Contributor

I'm not sure I understand the issue you are pointing out here.

The support vector in direction [1, 1] would be the point to the right.

I.e. [3, 0] is the correct support vector in that direction? In that case, it seems it is correctly calculated with ReLU(σ([1,1], LS))

@schillic
Copy link
Member Author

The support vector σ([1,1], LS) is not unique. In fact, any point on the original line segment is correct. So in this example it would be luck to get a correct result.
Even worse, by slightly moving the upper end to the right, σ([1,1], LS_skewed) would be unique and the top point, so then it would even be always wrong.

@tomerarnon
Copy link
Contributor

Ah, now I see. I lack the intuition around support vectors to see these things easily 😅

I imagine this is to do with the non convexity of the relu?

@tomerarnon
Copy link
Contributor

Actually, my guess here was also not correct. Although maybe in that case the non-convexity had something to do with it, that isn't the only issue. For example, if you take a Ball2 (that has a negative part, but remains convex after relu-ing) the resulting support vectors are incorrect. e.g. for a ball lying slightly above the x axis, relu(σ([p, ball)) will give the point on the x axis in the direction p, not the coordinates of the nearest vertex as it should.

@schillic
Copy link
Member Author

schillic commented May 8, 2019

Here are some new ideas/observations.

  1. ReLU is conceptually simple for boxes, namely again a box. Let Bi be the input box. We construct the output box Bo:

    1. Check in each dimension whether the corresponding lower bound of Bi is negative (one support-function query per dimension).
    2. For each negative case, bound Bo from below by 0. For each non-negative case, bound Bo from below by the bound of Bi.
    3. For each negative dimension, check whether the corresponding upper bound of Bi is positive.
    4. For each non-positive case, bound Bo from above by 0.

    In short: Bo = Hyperrectangle(low=max.(low(Bi), zeros(n)), high=max.(high(Bi), zeros(n)))

  2. box_approximation(ReLU(X)) == ReLU(box_approximation(X)).

  3. In the positive orthant the box approximation is very coarse. A better approximation can be obtained by taking the convex hull of all points on the positive orthant, the origin, and every intersection of the box approximation with the axes. I suggest that we cache these intersection points for efficiency. (Note that these intersection points may not exist, which is a desirable case because then the approximation is exact.)

Below is an example.

  • light blue: polygon X (to be ReLU'ed)
  • green: box approximation of X
  • red: true ReLU(X) set (union of three sets where the set Q1 is actually a subset of Q3)
  • yellow: box approximation of ReLU(X) (= ReLU(box(X)))
  • cyan: best convex approximation of ReLU(X)

relu

julia> using LazySets, Plots
julia> using LazySets.Approximations: box_approximation
julia> P = VPolygon([[-1., 1.], [-1.5, 0.5], [1.5, 0.5], [1., -0.5]]);
julia> A1 = LineSegment([0., 0.], [2., 0.]);
julia> A2 = LineSegment([0., 0.], [0., 1.3]);
julia> Q1 = LineSegment([0., 0.], [1.25, 0.]);
julia> Q2 = LineSegment([0., 0.], [0., 1.]);
julia> Q3 = convert(HPolygon, intersection(P, Hyperrectangle(low=[0., 0.], high=[2., 2.])));
julia> Bi = box_approximation(P)
julia> Bo = box_approximation(ConvexHullArray([Q1, Q2, Q3]))
julia> C = ConvexHullArray([Q1, Q2, Q3])

julia> plot([A1, A2], linecolor=:black, markercolor=:black)
julia> plot!(Bi, color=:green)
julia> plot!(Bo, color=:yellow)
julia> plot!(C, color=:cyan)
julia> plot!(P, color=:lightblue)
julia> plot!(Q3, color=:red)
julia> plot!([Q1, Q2], linecolor=:red, markercolor=:red)

@frehseg
Copy link

frehseg commented May 8, 2019 via email

@tomerarnon
Copy link
Contributor

tomerarnon commented May 11, 2019

Bo = Hyperrectangle(max.(low(Bi), zeros(n)), min.(high(Bi), zeros(n)))

Shouldn't the min in the second term also be a max? I.e. I think ReLU on a hyperrectangle should be the following (maybe they are equivalent since I am using the keyword constructor here instead of the default, but I was not seeing an obvious relation)

Bo = Hyperrectangle(low = relu.(low(Bi)), high = relu.(high(Bi)))

Potential correction aside, this is a great observation! We do something similar to this in a much messier and less clean way in our implementation of the MaxSens algorithm (I might steal this actually to clean up that code 😄).

Ultimately though, I'd say it is the cyan set that is the "holy grail" for relu. Or an even tighter approximation that doesn't project the set downwards onto the x-axis at all (since it isn't necessary in that region of the graph).

I see why this is a very tough problem though; I'm really glad to see that you are still working on it! Maybe it has been obvious, but I've had to redirect my attention elsewhere recently. I am still hoping to return to this towards the end of the academic quarter, but I clearly haven't been able to devote it enough time...

@schillic
Copy link
Member Author

schillic commented May 11, 2019

Shouldn't the min in the second term also be a max?

True, thanks! And I like the definition of a helper function relu 👍

I am using the keyword constructor here

Oh, I also wanted to use that. Fixed!

Or an even tighter approximation that doesn't project the set downwards onto the x-axis at all (since it isn't necessary in that region of the graph).

Yes, true, another error in my comment. I fixed it. And I also made the code runnable.

I clearly haven't been able to devote it enough time

Don't worry, the same goes for me 😊

@schillic
Copy link
Member Author

schillic commented May 14, 2019

To generalize the observations above:

  1. ReLU distributes with the Cartesian product: ReLU(X × Y) = ReLU(X) × ReLU(Y)
  2. ReLU is easy for intervals: ReLU([a, b]) = [relu(a), relu(b)]

This explains the good behavior for boxes, since the box is a Cartesian product of intervals.

  1. ReLU distributes with the union: ReLU(X ∪ Y) = ReLU(X) ∪ ReLU(Y)
  2. ReLU is easy to compute for (general) convex sets that are either nonnegative or nonpositive in each dimension (i.e., subsets of some orthant). In the nonnegative orthant, ReLU(X) = X. In each other orthant, ReLU(X) is the projection for each nonpositive entry. (This projection can be represented lazily.) Each case results in a convex set.

This leads to an algorithm to compute ReLU(Z) by subdividing Z into each orthant. (This subdivision can be implemented by an intersection with half-spaces.)
For boxes we obtain the algorithm mentioned before after some further simplifications, and coincidentally the union is again convex.
In general, the union can easily be overapproximated by the convex hull, and as is well-known this is the smallest convex set containing ReLU(Z).

Another observation:

  1. Let P = CHull({x ∈ vertices(P)}) be a polytope.
    Then CHull(ReLU(P)) = CHull({relu(x) | x ∈ vertices(P)}).
    In particular, vertices(CHull(ReLU(P))) ⊆ {relu(x) | x ∈ vertices(P)} ⊆ CHull(ReLU(P)).

This gives a simple algorithm to compute the vertex representation of the convex hull of a ReLU for polytopes.

@schillic
Copy link
Member Author

Here is a (manual) prototype implementation. The imprecision in the plots is due to approximation imprecision.

julia> using LazySets, LazySets.Approximations, Plots, Optim

julia> xlim = [-1.5, 1.5]; ylim = [-0.5, 1.0];


# original set
julia> P = VPolygon([[-1., 1.], [-1.5, 0.5], [1.5, 0.5], [1., -0.5]]);

julia> plot(P, xlim=xlim, ylim=ylim)

ReLU1

# subdivision into orthants
julia> Q1 = P  HPolygon([HalfSpace([-1., 0.], 0.), HalfSpace([0., -1.], 0.)]);

julia> Q2 = P  HPolygon([HalfSpace([-1., 0.], 0.), HalfSpace([0., 1.], 0.)]);

julia> Q3 = P  HPolygon([HalfSpace([1., 0.], 0.), HalfSpace([0., 1.], 0.)]);

julia> Q4 = P  HPolygon([HalfSpace([1., 0.], 0.), HalfSpace([0., -1.], 0.)]);

julia> plot!(Q1, Nφ=60)

julia> plot!(Q2, Nφ=60)

julia> plot!(Q3, Nφ=60)

julia> plot!(Q4, Nφ=60)

ReLU2

# projection
julia> R2 = [1. 0.; 0. 0.] * Q2;

julia> R3 = [0. 0.; 0. 0.] * Q3;

julia> R4 = [0. 0.; 0. 1.] * Q4;

# identify redundant sets (not needed, but simplifies things later)
julia> [R2  P, R3  P, R4  P]
3-element Array{Bool,1}:
  true
  true
 false

julia> plot(P, xlim=xlim, ylim=ylim)

julia> plot!(Q1, Nφ=60, color=:red)

julia> plot!(R4, Inf, linecolor=:red, width=5)

ReLU3

# union (not convex)
julia> U = UnionSetArray([Q1, R4]);

# convex hull
julia> UC = ConvexHullArray([Q1, R4]);

# need to overapproximate for plotting lazy convex hull of lazy intersections
julia> Upolar = overapproximate(UC, PolarDirections(20));

julia> plot(P, xlim=xlim, ylim=ylim)

julia> plot!(Upolar)

ReLU4

I was actually surprised that the inclusion checks worked out. In general I suspect to get precision problems. The simplest sufficient case is if the set on the l.h.s. is empty (not seen in this example, but generally this should happen), and this can be detected without an inclusion check.

Note that this "algorithm" is exponential in the number of dimensions. It will be vital to first detect dimensions where the corresponding 0-hyperplane is not crossed to cut down the number of sets.

@frehseg
Copy link

frehseg commented May 17, 2019

I tried to see what the support function of ReLU looks like. Let's consider that ReLU is applied only to the n-th (last) dimension, which we denote by
ReLU(X,n)
Let d=[r,s], where s is the coefficient in the last dimension. Then I get the following:
\rho_{ReLU(X,n)}([r,s]) = \max \Bigl{ \inf_{t\geq 0}  \rho_X\bigl([r,-t]\bigr),\inf_{t'\geq 0}  \rho_X\bigl([r,s+t']\bigr) \Bigr}
This is an interesting function, because the function that's "infimized" on the left is the same as the one on the right, just mirrored and shifted. We can write it as
\rho_{ReLU(X,n)}([r,s]) = \max \Bigl{ \inf_{t\geq 0} f(-t),\inf_{t'\geq 0} f(s+t')\bigr) \Bigr}
where f(t) is a convex function. With this knowledge, we can do some case distinctions in the domain of t and deduce that the maximum can be only attained at t=0, t=s, or for some t in [0,s].

In sum, we get that if s>=0,
\rho_{ReLU(X,n)}([r,s]) = \max \Bigl{ \rho_X\bigl([r,0]\bigr),\rho_X\bigl([r,s]\bigr) \Bigr} =  \max_{0 \leq t \leq s}  \rho_X\bigl([r,t]\bigr)
The above formula only gives an upper bound if s<0. The exact solution for s<0 is
\rho_{ReLU(X,n)}([r,s]) =  \min_{s \leq t \leq 0}  \rho_X\bigl([r,t]\bigr)

Hope this helps.

@frehseg
Copy link

frehseg commented May 17, 2019

As a corollary, we have that
\rho_{ReLU(X,n)}([r,s]) \leq   \rho_X\bigl([r,0]\bigr) + \max \Bigl{ 0, \rho_X\bigl([0,s]\bigr) \Bigr}
\rho_{ReLU(X,n)}([r,s]) \leq   \rho_X\bigl([r,s]\bigr) + \max \Bigl{ 0, \rho_X\bigl([0,-s]\bigr) \Bigr}

All of the above under the disclaimer that maybe I messed up the proof somewhere. ;-)

@frehseg
Copy link

frehseg commented May 17, 2019

Fun conjecture (walking on thin ice):
\textrm{If } 0 \in X, \textrm{ then } \rho_{ReLU(X)}(d) = \rho_{X}\bigl(ReLU(d)\bigr).
with corollary
\textrm{If } 0 \in X, \textrm{ then } \rho_{A\cdot ReLU(X)+b}(d) = \rho_{X}\bigl(ReLU(A^T d)\bigr)+b^T d.

Waiting for your counterexamples to roll in...

@schillic
Copy link
Member Author

As a corollary, we have that [...]

This sounds promising because it is cheap, but I am somehow missing a recursion of ρᵣₑₗᵤ(X). So you have to choose a single dimension to play with?

Waiting for your counterexamples to roll in...

Give me a few minutes.

@schillic
Copy link
Member Author

schillic commented May 17, 2019

Waiting for your counterexamples to roll in...

Not a counterexample, but a pathological case: What if d = [-1, -1, ..., -1]?

Here is a real counterexample:
Take my sets P and C ≈ ReLU(P) from above and consider d = [0.1, 1.0] = ReLU([0.1, 1.0]).

julia> ρ([0.1, 1.0], P)
0.9

julia> ρ([0.1, 1.0], C)
1.0

It is not a counterexample to an overestimation, though. If your result is an overestimation (I am not sure), it may still be useful and maybe in practice that bound is good enough.

@frehseg
Copy link

frehseg commented May 18, 2019

Thank you Christian for bringing me back down to earth. Upon closer inspection I only get the following:

In general, supp(RL(x),d) >= 0 if all d_i>=0, and <= 0 otherwise.
If 0 in X, then supp(RL(x),d) >= 0 if all d_i>=0, and 0 otherwise.

@schillic
Copy link
Member Author

schillic commented May 18, 2019

I only get the following

Do you mean that all results so far do not hold?

The example from above is at least also a counterexample to the last result here:

In sum, we get that if s>=0 [...]

Consider direction [r, s] = [-1, 1]. Since s >= 0, the corollary predicts that ρ([-1, 1], C) = max(ρ([-1, 0], P), ρ([-1, 1], P)). Let's see what we get.

julia> ρ([-1., 1.], C)
1.0

julia> ρ([-1., 0.], P)
1.5

julia> ρ([-1., 1.], P)
2.0

julia> 1.0 == max(1.5, 2.0)
false

@frehseg
Copy link

frehseg commented May 18, 2019

No, only the conjecture is false. The rest should be ok. Your counterexample doesn’t apply, because you need to apply the formula nested, i.e., componentwise (everywhere where there’s an s it means that ReLU is applied to a single variable).

ρ([-1, 1], C) = max(ρ([-1, 0], ReLU_1(P)), ρ([-1, 1], ReLU_1(P))) = max(min_[-1,0] ρ([t, 0], P), min_[-1,0] ρ([t, 1], P)).

@schillic
Copy link
Member Author

Hm, then I did not understand the formula
\rho_{ReLU(X)}([r,s]) = \max \Bigl{ \rho_X\bigl([r,0]\bigr),\rho_X\bigl([r,s]\bigr) \Bigr}

@frehseg
Copy link

frehseg commented May 18, 2019

I changed the notation in my posts above to clarify where I just consider the ReLU applied to one variable.

The general formula for ReLU applied to all variables is as follows:
\rho_{ReLU(X)}(d) \leq \max_{I \subseteq [1,\ldots,n]} \rho_X\Bigl( {\sum}_{i \in I} d_i e_i \Bigr)
The above bound is tight if all elements of d are positive. If all elements of d are negative, then
\rho_{ReLU(X)}(d) =  \min_{d \leq t \leq 0} \rho_X (t)

@frehseg
Copy link

frehseg commented May 18, 2019

The above bounds are related your results with bounding boxes as follows. We can overapproximate the support function by decomposing the direction d into an arbitrary set of vectors L, as long as their sum is equal to d:

(the above notation gets zero points for readability, sorry)
So we can substitute
\rho_X\Bigl( {\sum}{i \in I} d_i e_i \Bigr) \leq {\sum}{i \in I} |d_i| \rho_X\bigl( \mathrm{sign}(d_i) e_i \bigr)
to get a result that only uses the support function of the bounding box:

which we can simplify to:
 \rho_{ReLU(X)}(d) \leq  {\sum} \Bigl{ |d_i| \rho_X\bigl( \mathrm{sign}(d_i) e_i \bigr) \Bigm| \rho_X\bigl( \mathrm{sign}(d_i) e_i\bigr) \geq 0  \Bigr}
If I'm not mistaken, the right hand side is the support function of the ReLU applied to the bounding box of the union of X with 0.
We can rewrite the above in a fun (not necessarily useful) way:
 \rho_{ReLU(X)}(d) \leq  \sum_{i=1}^n ReLU\Bigl(\rho_X\bigl( d_i e_i \bigr)\Bigr) = \sum_{i=1}^n \Bigl(\rho_{0 \cup X}\bigl( d_i e_i \bigr)\Bigr)

@frehseg
Copy link

frehseg commented May 18, 2019

Here's an attempt at a general form. Let ReLU(X,I) denote that ReLU is applied to the dimensions in the index set I.

Let d be decomposed into its positive and negative parts,

and let I+ be the indices where d>=0 and I- where d < 0.
Then we get for any d,
\begin{array}{rcl} \rho_{ReLU(X)}(d) &=& \rho_{ReLU(ReLU(X,I^-),I^+)}(d^+ - d^-) \ &=& \max_{0 \leq t^+ \leq d^+} \rho_{ReLU(X,I^-)}(t^+ - d^-) \ &=& \max_{0 \leq t^+ \leq d^+} \min_{0 \leq t^- \leq d^-} \rho_X (t^+-t^-) \ &=& \min_{0 \leq t^- \leq d^-} \max_{0 \leq t^+ \leq d^+}  \rho_X (t^+-t^-) \end{array}

schillic added a commit that referenced this issue May 23, 2019
schillic added a commit that referenced this issue May 23, 2019
schillic added a commit that referenced this issue Aug 1, 2019
@schillic schillic removed their assignment Sep 3, 2019
@schillic schillic added discussion 🗣️ Requires human input and removed feature ➕ A new feature labels Sep 3, 2019
@tomerarnon
Copy link
Contributor

tomerarnon commented Jul 28, 2020

Rekindling an old fire here! But I thought this might be the place to comment this thought:

I've just taken a dive into the implementation of to_union_of_projections, which underlies the concrete Rectification, and I think there is perhaps one additional pruning step that can be done. After the mixed dimensions are computed (and determined to be non-empty), checking whether the support vector in a given mixed dimension is less than the support vector of the positive-orthant-subset of X (which would need to be computed at the outset), can prune projections that are entirely contained within the positive subsection of X. I.e. something like

# bad math probably, and X ∩ O⁺ should be computed 
# only once of course, but I hope the idea is clear.
# for d ∈ mixed_dimensions

if ρ(X  O⁺, d) < ρ(X, d)  
    # d is a relevant direction. 
    # Should proceed to calculate.
else
    # Projection(X, d) is a subset of X ∩ O⁺, and so should be skipped.
end

I'm not sure if this poses problems with different set types, or anything like that, but it was a thought that occurred to me, so I went digging through the source to see if it was already in place (and couldn't find it, or just failed to identify it).

Most efficiently, I think this step would be incorporated into the compute_negative_and_mixed_dimensions pass at the start rather than as a separate pass, but it requires computing the intersection with the positive orthant (maybe concretely?) so I don't know what impact that would have on the algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants