forked from mmalahe/unc-dissertation
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcrunch-slide-DeepRelu.tex
220 lines (197 loc) · 11.3 KB
/
crunch-slide-DeepRelu.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
Bramble-Hilbert Lemma
BH scaling
Sobolev Embedding Condition
A Bramble-Hilbert lemma scaling argument gives
\begin{align*}
\left\|f_l\right\|_{L_p(\Omega)} \leq C 2^{1(d / q-d / p-s)}
\end{align*}
- Suppose by scaling that $\|f\|_{W^s\left(L_q(\Omega)\right)} \leq 1$
- $d / q-d / p-s<0$ (Sobolev Embedding condition)
- Guarantees the series for $f$ converges in $L_p$
Truncating the series at level 10 gives error
\begin{align*}
\left\|f-\sum_{l=0}^{I_0} f_l\right\|_{L_p(\Omega)} \leq C \sum_{l=l_0+1}^{\infty} 2^{l(d / q-d / p-s)} \leq C 2^{l_0(d / q-d / p-s)}
\end{align*}
- (Naive) linear method gets rate $C N^{(1 / q-1 / p-s / d)}$
- Why are non-linear methods better?
- Write down the basis at level I
\begin{align*}
\rho_{\mathbf{i}}^\alpha(x)= \begin{cases}\prod_{j=1}^d\left(2^{\prime} x_j-\mathbf{i}_j\right)^{\alpha_j} & x \in \Omega_{\mathbf{i}} \\ 0 & \text { otherwise }\end{cases}
\end{align*}
- $\alpha$ is a multi-index representing the degree
- If we represent $f$, with respect to this basis
\begin{align*}
f_l=\sum_{\alpha, \mathbf{i}} a_{\alpha, \mathbf{i}} \rho_{\mathbf{i}}^\alpha(x)
\end{align*}
- Let us write $\mathbf{a}=\left(a_{\alpha, i}\right) \in \mathbb{R}^{K 2^{l d}}$
- Note $K=\left(\begin{array}{c}k+d \\ k\end{array}\right)$
- We can calculate that
\begin{align*}
\|\mathbf{a}\|_{\ell^q} \leq C 2^{l(d / q-s)}
\end{align*}
- This means that $\mathbf{a}$ is approximately sparse!
- Indeed, there is an $m$-sparse vector ã such that
\begin{align*}
\|\mathbf{a}-\tilde{\mathbf{a}}\|_{\ell^p} \leq\|\mathbf{a}\|_{\ell^q} m^{1 / p-1 / q}
\end{align*}
- Finally, putting everything together and optimizing the choice of $m$ at each level gives a non-linear rate of $C N^{-s / d}$
- Key point: The (approximate) sparsity of a enables non-linear approximation!
Fundamental Lower Bound: Metric Entropy
Definition (Kolmogorov)
Let $X$ be a Banach space and $B \subset X$. The metric entropy numbers of $B$, $\epsilon_n(B)_X$ are given by
$\epsilon_n(B)_X=\inf \left\{\epsilon: B\right.$ is covered by $2^n$ balls of radius $\left.\epsilon\right\}$.
- Roughly speaking, $\epsilon_n(B)_K$ measures how accurately elements of $B$ can be specified with $n$ bits.
- Gives a fundamental lower bound on the rates of encodable approximation
- If compact Sobolev embedding holds, then ${ }^5$
\begin{align*}
\varepsilon_n\left(B^s\left(L_q(\Omega)\right)\right)_{L^p(\Omega)} \approx n^{-s / d}
\end{align*}
Ref:
${ }^5 \mathrm{M}$ Š Birman and MZ Solomjak. "Piecewise-polynomial approximations of functions of the classes $W_p^{\alpha \prime \prime}$. In: Mathematics of the USSR Shornik $23(1967)$, p. 295.
\section{Deep ReLU Networks}
- Consider an affine map $A_{\mathbf{W}, b}: \mathbb{R}^n \rightarrow \mathbb{R}^k$
\begin{align*}
A_{\mathbf{W}, b}(x)=\mathbf{W} x+b .
\end{align*}
- Let $\sigma(x)=\max (0, x)$ denote the ReLU
- When applied to a vector, $\sigma$ is applied component-wise
- A deep ReLU network with width $W$ and depth $L$ mapping $\mathbb{R}^d$ to $\mathbb{R}^k$ is a composition
\begin{align*}
A_{\mathbf{W}_L, b_L} \circ \sigma \circ A_{\mathbf{W}_{L-1}, b_{L-1}} \circ \sigma \circ \cdots \circ \sigma \circ A_{\mathbf{W}_1, b_1} \circ \sigma \circ A_{\mathbf{W}_0, b_0}
\end{align*}
- Here $A_{\mathbf{w}_1, b_1}, \ldots ., A_{\mathbf{w}_{L-1}, b_{L-1}}: \mathbb{R}^W \rightarrow \mathbb{R}^W$
- We denote the set of these by $\Upsilon w, L\left(\mathbb{R}^d, \mathbb{R}^k\right)$
- All functions $f \in \Upsilon^{W, L}\left(\mathbb{R}^d, \mathbb{R}^k\right)$ are continuous and piecewise linear
- The number of pieces can be exponential in the depth $L$
- Number of parameters scales like $W^2 L$
- Classical piecewise linear finite element functions can be represented ${ }^6$
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{deeprelufemlinear}
\caption{}
\label{fig:deeprelufemlinear}
\end{figure}
Ref
'Juncai He et al. "ReLU Deep Neural Networks and Linear Finite Elements". In: Journal of Computational Mathematics $38.3$ (2020) pp 502–527.
\section{Deep ReLU Network Approximation of Sobolev Functions}
How Efficiently Can Deep ReLU Networks Approximation Sobolev Functions?
- Let $X=L_p(\Omega)$ and $B=B^s\left(L_q(\Omega)\right)$
- Let $\mathcal{F}_L=\Upsilon W, L\left(\mathbb{R}^d\right)$
- Fixed but sufficiently large width $W$
- Let the depth $L \rightarrow \infty$
- Number of parameters $O(L)$
- This regime gives best rates in terms of parameter count
- What are the optimal rates of approximation by deep networks:
\begin{align*}
\sup _{f \in B^s\left(L_q(\Omega)\right)} \inf _{L_L \in \mathcal{F}_L}\left\|f-f_L\right\|_{L_p(\Omega)} ?
\end{align*}
\section{Prior Work: Superconvergence}
A recent surprising result REF-Yarotsky
Theorem:
Suppose that $p=q=\infty$ and $0<s \leq 1$. Then $W^s\left(L_{\infty}(\Omega)\right)$ is the class of Lipschitz functions. Then for sufficiently large $W$ (depending upond)
\begin{align*}
\inf _{f_L \in r^{w, L\left(\mathbb{R}^d\right)}}\left\|f-f_L\right\|_{L_{\infty}(\Omega)} \leq C\|f\|_{W^s\left(L_{\infty}(\Omega)\right)} L^{-2 s / d} .
\end{align*}
- This is sharp for deep ReLU networks
- Classical methods (even nonlinear) can only get a rate of convergence $N^{-s / d}$
- Metric entropy decays like $N^{-s / d}$ : Non-encodable weights!
${ }^7$ Dmitry Yarotsky. "Optimal approximation of continuous functions by very deep ReLU networks". In: arXiv preprint arXiv:1802.03620 (2018), Zuowei Shen, Haizhao Yang, and Shijun Zhang.
"Optimal approximation rate of ReLU networks in terms of width and depth". Journal de Mathématiques Pures et Appliquées 157 (2022) pp. 101–135.
\section{Prior Work: Extension}
- Superconvergence result has been generalized ${ }^8$ to $s>1$
- Optimal approximation rates when both depth and width vary ${ }^9$
- Derivatives can also be approximated ${ }^{10}$ if $s>1$
- Suboptimal results for Sobolev spaces $W^s\left(L_q\right)$ with $q \leqslant \infty$ have been obtained $^{11}$
- Our interest: Do we get superconvergence when $q<p \leq \infty$ ? What is the optimal rate for all pairs $s, p, q$ ?
- All existing results only apply when $q=\infty$
${ }^8$ Jianfeng Lu et al. "Deep network approximation for smooth functions". In: SIAM Journal on Mathematical Analysis $53.5$ (2021), pp. 5465-5506.
${ }^9$ Zuowei Shen, Haizhao Yang, and Shijun Zhang. "Optimal approximation rate of ReLU networks in terms of width and depth". In: Journal de Mathématiques Pures et Appliquées 157 (2022), pp. 101-135.
${ }^{10}$ Sean Hon and Haizhao Yang. "Simultaneous neural network approximations in sobolev spaces". In: arXiv preprint arXiv:2109.00161 (2021).
${ }^{11}$ Ronald DeVore, Boris Hanin, and Guergana Petrova. "Neural Network Approximation". In: arXiv preprint arXiv:2012.14501 (2020).
\subsubsection{Reduction to Coefficient Coding}
- Consider the region
\begin{align*}
\Omega_{\mathbf{i}}^\epsilon=\prod_{j=1}^d\left[\mathbf{i}_j n^{-1}, \mathbf{i}_{j+1} n^{-1}-\epsilon\right]
\end{align*}
- There exists a small (depth $O(\log (n))$ network which maps ${ }^{12}$
\begin{align*}
\Omega_{\mathbf{i}}^\epsilon \rightarrow \operatorname{index}(\mathbf{i}):=\sum_{j=1}^d \mathbf{i}_j n^{j-1}
\end{align*}
Reduction to Coefficient Coding
- Consider the region
\begin{align*}
\Omega_{\mathbf{i}}^\epsilon=\prod_{j=1}^d\left[\mathbf{i}_j n^{-1}, \mathbf{i}_{j+1} n^{-1}-\epsilon\right]
\end{align*}
- There exists a small (depth $O(\log (n))$ network which maps ${ }^{12}$
\begin{align*}
\Omega_{\mathbf{i}}^\epsilon \rightarrow \operatorname{index}(\mathbf{i}):=\sum_{j=1}^d \mathbf{i}_j n^{j-1}
\end{align*}
${ }^{12}$ Zuowei Shen, Haizhao Yang, and Shijun Zhang. "Optimal approximation rate of ReLU networks in terms of width and depth". In: Journal de Mathématiques Pures et Appliquées 157 (2022), pp. 101-135.
${ }^{13}$ Dmitry Yarotsky. "Error bounds for approximations with deep ReLU networks". In: Neural Networks 94 (2017), pp. 103–114.
\section{Bit Extraction}
- The key to superconvergence is the bit-extraction technique ${ }^{14}$
- Suppose that $x \in\{0,1\}^N$
- How many parameters do we need to represent $\mathbf{x}$ ?
- i.e. want a network $f$, s.t. $f(i)=\mathbf{x}_i$ for $i=0, \ldots, N-1$.
- Naively, we would need $O(N)$ parameters
- Say use a piecewise linear function
- Remarkably, we only need $O(\sqrt{N})$ !
${ }^{14}$ Peter Bartlett, Vitaly Maiorov, and Ron Meir. "Almost linear VC dimension bounds for piecewise polynomial networks". In: Advances in neural information processing systems 11
\section{Bit Extraction 2}
- Divide $\{0,1, \ldots, N-1\}$ into $O(\sqrt{N})$ sub-intervals of $I_1, \ldots, I_n$ of length $O(\sqrt{N})$
- $I_j=\left\{k_j, k_j+1, \ldots, k_{j+1}-1\right\}$
- Two piecewise linear functions:
- Map $l_j$ to $k_j$
- Map $I_j$ to $b_j=0 . \mathbf{x}_{k_j} \ldots \mathbf{x}_{k_{j+1}-1}$
- Requires $O(\sqrt{N})$ layers
- Construct network which maps
\begin{align*}
\left(\begin{array}{c}
i \\
k \\
0 \cdot x_1 x_2 \cdots x_n \\
z
\end{array}\right) \rightarrow\left(\begin{array}{c}
i-1 \\
k \\
0 \cdot x_2 \cdots x_n \\
z+x_1 \chi(i=k)
\end{array}\right)
\end{align*}
- Can be done with a constant size network
- Compose this $O(\sqrt{N})$ times
\section{Efficient Representation of Sparse Vectors ${ }^{15}$}
- Existing works use this technique to represent polynomials coefficient map using $\sqrt{N}$ parameters
- Works for Sobolev approximation in the linear regime
- Sobolev approximation in non-linear regime $(p>q)$ requires adaptivity or sparsity
Proposition
Let $M \geq 1$ and $N \geq 1$ and $\mathrm{x} \in \mathbb{Z}^N$ be an $N$-dimensional vector satisfying
\begin{align*}
\|\mathbf{x}\|_{\ell^1} \leq M
\end{align*}
- Then if $N \geq M$, there exists a neural network $g \in \Upsilon^{17, L}(\mathbb{R}, \mathbb{R})$ with depth $L \leq C \sqrt{M}(1+\log (N / M))$ which satisfies $g(i)=x_i$ for $i=1, \ldots, N$.
- Further, if $N<M$, then there exists a neural network $g \in \Upsilon^{21, L}(\mathbb{R}, \mathbb{R})$ with depth $L \leq C \sqrt{N}(1+\log (M / N))$ which satisfies $g(i)=x_i$ for $i=1, \ldots, N$.
${ }^{15}$ Jonathan W Siegel. "Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev Spaces". In: arXiv preprint
\section{Main Result: Upper Bounds}
Theorem
Let $\Omega=[0,1]^d$ be the unit cube and let $0<s<\infty$ and $1 \leq q \leq p \leq \infty$. Assume that $1 / q-1 / p<s / d$, which guarantees that we have the compact Sobolev embedding
\begin{align*}
W^s\left(L_q(\Omega)\right) \subset \subset L^p(\Omega) .
\end{align*}
Then there exists an absolute constant $K<\infty$ and such that
\begin{align*}
\inf _{f_L \in \Upsilon^{K d, L\left(\mathbb{R}^d\right)}}\left\|f-f_L\right\|_{L_p(\Omega)} \lesssim\|f\|_{W^s\left(L_q(\Omega)\right)} L^{-2 s / d} .
\end{align*}
- Get Superconvergence whenever one has a compact Sobolev embedding
${ }^{16}$ Jonathan W Siegel. "Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev Spaces". In: arXiv preprint arXiv. 2211,14400 (2022).
- Determined sharp approximation rates for deep ReLU networks on Sobolev spaces
- Some open problems:
\begin{itemize}
\item Can we approximate derivatives as well?
\item Super-convergence is inherently unstable (weights are not encodable)
\item Sobolev spaces may be too general, can we show that specific solutions to PDEs can be well approximated by deep networks
\item Extensions to other activation functions and architectures
\item Understanding the optimization and discretization errors as well
\end{itemize}
Notes:
For Shallow Networks, can use Greedy Algorithms