FluxML · CarloLucibello · Mar 23, 2023 · Mar 22, 2023 · Mar 22, 2023 · Mar 22, 2023
diff --git a/.gitignore b/.gitignore
@@ -8,3 +8,4 @@ deps
 .vscode
 Manifest.toml
 LocalPreferences.toml
+.DS_Store
diff --git a/NEWS.md b/NEWS.md
@@ -1,14 +1,16 @@
 # Flux Release Notes
 
+## v0.13.15
+* Added [MultiHeadAttention](https://github.com/FluxML/Flux.jl/pull/2146) layer.
 
 ## v0.13.14
 * Fixed various deprecation warnings, from `Zygone.@nograd` and `Vararg`.
+* Initial support for `AMDGPU` via extension mechanism.
+* Add `gpu_backend` preference to select GPU backend using `LocalPreference.toml`.
+* Add `Flux.gpu_backend!` method to switch between GPU backends.
 
 ## v0.13.13
 * Added `f16` which changes precision to `Float16`, recursively.
-* Initial support for AMDGPU via extension mechanism.
-* Add `gpu_backend` preference to select GPU backend using `LocalPreference.toml`.
-* Add `Flux.gpu_backend!` method to switch between GPU backends.
 
 ## v0.13.12
 * CUDA.jl 4.0 compatibility.

diff --git a/docs/src/models/layers.md b/docs/src/models/layers.md
@@ -10,7 +10,7 @@ The `Dense` exemplifies several features:
 
 * It take an `init` keyword, which accepts a function acting like `rand`. That is, `init(2,3,4)` should create an array of this size. Flux has [many such functions](@ref man-init-funcs) built-in. All make a CPU array, moved later with [`gpu`](@ref Flux.gpu) if desired.
 
-* The bias vector is always intialised [`Flux.zeros32`](@ref). The keyword `bias=false` will turn this off, i.e. keeping the bias permanently zero.
+* The bias vector is always initialised [`Flux.zeros32`](@ref). The keyword `bias=false` will turn this off, i.e. keeping the bias permanently zero.
 
 * It is annotated with [`@functor`](@ref Functors.@functor), which means that [`params`](@ref Flux.params) will see the contents, and [`gpu`](@ref Flux.gpu) will move their arrays to the GPU.
 
@@ -54,6 +54,15 @@ SamePad
 Flux.flatten
 ```
 
+## MultiHeadAttention
+
+The basic blocks needed to implement [Transformer](https://arxiv.org/abs/1706.03762) architectures. See also the functional counterparts
+documented in NNlib's [Attention](@ref) section.
+
+```@docs
+MultiHeadAttention
+``` 
+
 ### Pooling
 
 These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.

diff --git a/docs/src/models/nnlib.md b/docs/src/models/nnlib.md
@@ -2,9 +2,20 @@
 
 Flux re-exports all of the functions exported by the [NNlib](https://github.com/FluxML/NNlib.jl) package. This includes activation functions, described on [their own page](@ref man-activations). Many of the functions on this page exist primarily as the internal implementation of Flux layer, but can also be used independently.
 
+
+## Attention
+
+Primitives for the [`MultiHeadAttention`](ref) layer.
+
+```@docs
+NNlib.dot_product_attention
+NNlib.dot_product_attention_scores
+NNlib.make_causal_mask
+```
+
 ## Softmax
 
-`Flux`'s `logitcrossentropy` uses `NNlib.softmax` internally.
+`Flux`'s [`Flux.logitcrossentropy`](@ref) uses [`NNlib.logsoftmax`](@ref) internally.
 
 ```@docs
 softmax
@@ -13,7 +24,8 @@ logsoftmax
 
 ## Pooling
 
-`Flux`'s `AdaptiveMaxPool`, `AdaptiveMeanPool`, `GlobalMaxPool`, `GlobalMeanPool`, `MaxPool`, and `MeanPool` use `NNlib.PoolDims`, `NNlib.maxpool`, and `NNlib.meanpool` as their backend.
+`Flux`'s [`AdaptiveMaxPool`](@ref), [`AdaptiveMeanPool`](@ref), [`GlobalMaxPool`](@ref), [`GlobalMeanPool`](@ref), 
+[`MaxPool`](@ref), and [`MeanPool`](@ref) use [`NNlib.PoolDims`](@ref), [`NNlib.maxpool`](@ref), and [`NNlib.meanpool`](@ref) as their backend.
 
 ```@docs
 PoolDims
@@ -32,7 +44,7 @@ pad_zeros
 
 ## Convolution
 
-`Flux`'s `Conv` and `CrossCor` layers use `NNlib.DenseConvDims` and `NNlib.conv` internally. 
+`Flux`'s [`Conv`](@ref) and [`CrossCor`](@ref) layers use [`NNlib.DenseConvDims`](@ref) and [`NNlib.conv`](@ref) internally. 
 
 ```@docs
 conv
@@ -44,7 +56,7 @@ DenseConvDims
 
 ## Upsampling
 
-`Flux`'s `Upsample` layer uses `NNlib.upsample_nearest`, `NNlib.upsample_bilinear`, and `NNlib.upsample_trilinear` as its backend. Additionally, `Flux`'s `PixelShuffle` layer uses `NNlib.pixel_shuffle` as its backend.
+`Flux`'s [`Upsample`](@ref) layer uses [`NNlib.upsample_nearest`](@ref), [`NNlib.upsample_bilinear`](@ref), and [`NNlib.upsample_trilinear`](@ref) as its backend. Additionally, `Flux`'s [`PixelShuffle`](@ref) layer uses [`NNlib.pixel_shuffle`](@ref) as its backend.
 
 ```@docs
 upsample_nearest
@@ -60,7 +72,7 @@ pixel_shuffle
 
 ## Batched Operations
 
-`Flux`'s `Bilinear` layer uses `NNlib.batched_mul` internally.
+`Flux`'s [`Flux.Bilinear`](@ref) layer uses [`NNlib.batched_mul`](@ref) internally.
 
 ```@docs
 batched_mul
@@ -72,7 +84,7 @@ batched_vec
 
 ## Gather and Scatter
 
-`Flux`'s `Embedding` layer uses `NNlib.gather` as its backend.
+`Flux`'s [`Embedding`](@ref) layer uses [`NNlib.gather`](@ref) as its backend.
 
 ```@docs
 NNlib.gather

diff --git a/docs/src/tutorials/2021-10-08-dcgan-mnist.md b/docs/src/tutorials/2021-10-08-dcgan-mnist.md
@@ -101,7 +101,7 @@ We will be using the [relu](https://fluxml.ai/Flux.jl/stable/models/nnlib/#NNlib
 We will also apply the weight initialization method mentioned in the original DCGAN paper.
 
 ```julia
-# Function for intializing the model weights with values 
+# Function for initializing the model weights with values 
 # sampled from a Gaussian distribution with μ=0 and σ=0.02
 dcgan_init(shape...) = randn(Float32, shape) * 0.02f0
 ```