v7.3.0: Mish activation and experimental optimizers
✨ New features and improvements
- Add Mish activation. Use via the
thinc.v2v.Mish
layer, which computesf(X) = mish(W @ X + b)
. CUDA and Cython kernels are included to make the activation efficient. - Add experimental support for RAdam to the optimizer. Enable it with the keyword argument
use_radam
toTrue
. In preliminary testing, it's a small change that's worth enabling. - Add experimental support for Lookahead to the optimizer. Enable it by setting the keyword argument
lookahead_k
to a positive integer. In preliminary testing, it helps if you're not using parameter averaging, but with averaging it's a bit worse. - Add experimental support for LARS to the optimizer. Enable it by setting
use_lars
toTrue
. In preliminary testing, this hasn't worked well at all – possibly our implementation is broken.
🙏 Acknowledgements
Big thanks to @digantamisra98 for the Mish activation, especially the extensive experiments and simple gradient calculation. We expect to be using the activation in the next round of spaCy models.
Gratitude to the fast.ai community for their crowd-sourced experiments, and especially to users @lessw2020, @mgrankin and others for their optimizer implementations, which we referenced heavily when implementing the optimizers for Thinc. More importantly, it's super helpful to have a community filtering the deluge of papers for techniques that work on a few different datasets. This thread on optimization research was particularly helpful.