tags | |
---|---|
|
GeLU or Gaussian Error Linear Unit is an alternative to ReLU. It was introduced by Hendrycks et al. (2016) and used in BERT paper. GeLU is defined as:
where
The authors reasoning was to combine dropout (stochastically multiply input by
one or zero) and ReLU (multiply the input by one or zero depending on the
input's value). And so they used Bernoulli distribution to sample
- almost zero for large negative inputs
- almost linear for large positive inputs
- smooth transition near 0.
However, to avoid sampling random number constantly, they instead used the
expected value of