From 890f6f60a9046d744028e99ed6b111f7990b0c75 Mon Sep 17 00:00:00 2001
From: CarloLucibello <carlo.lucibello@gmail.com>
Date: Sun, 27 Jun 2021 19:22:26 +0200
Subject: [PATCH 1/5] fix recurrence docs

c
---
 docs/src/models/recurrence.md | 89 ++++++++++++++++++-----------------
 1 file changed, 46 insertions(+), 43 deletions(-)

diff --git a/docs/src/models/recurrence.md b/docs/src/models/recurrence.md
index 245b32d297..5ab7b0506b 100644
--- a/docs/src/models/recurrence.md
+++ b/docs/src/models/recurrence.md
@@ -13,7 +13,7 @@ An aspect to recognize is that in such model, the recurrent cells `A` all refer
 In the most basic RNN case, cell A could be defined by the following: 
 
 ```julia
-Wxh = randn(Float32, 5, 2)
+Wxh = randn(Float32, 5, 4)
 Whh = randn(Float32, 5, 5)
 b   = randn(Float32, 5)
 
@@ -22,7 +22,7 @@ function rnn_cell(h, x)
     return h, h
 end
 
-x = rand(Float32, 2) # dummy data
+x = rand(Float32, 4) # dummy data
 h = rand(Float32, 5)  # initial hidden state
 
 h, y = rnn_cell(h, x)
@@ -37,9 +37,9 @@ There are various recurrent cells available in Flux, notably `RNNCell`, `LSTMCel
 ```julia
 using Flux
 
-rnn = Flux.RNNCell(2, 5)
+rnn = Flux.RNNCell(4, 5)
 
-x = rand(Float32, 2) # dummy data
+x = rand(Float32, 4) # dummy data
 h = rand(Float32, 5)  # initial hidden state
 
 h, y = rnn(h, x)
@@ -50,7 +50,7 @@ h, y = rnn(h, x)
 For the most part, we don't want to manage hidden states ourselves, but to treat our models as being stateful. Flux provides the `Recur` wrapper to do this.
 
 ```julia
-x = rand(Float32, 2)
+x = rand(Float32, 4)
 h = rand(Float32, 5)
 
 m = Flux.Recur(rnn, h)
@@ -60,11 +60,11 @@ y = m(x)
 
 The `Recur` wrapper stores the state between runs in the `m.state` field.
 
-If we use the `RNN(2, 5)` constructor – as opposed to `RNNCell` – you'll see that it's simply a wrapped cell.
+If we use the `RNN(4, 5)` constructor – as opposed to `RNNCell` – you'll see that it's simply a wrapped cell.
 
 ```julia
-julia> RNN(2, 5)
-Recur(RNNCell(2, 5, tanh))
+julia> RNN(4, 5)
+Recur(RNNCell(4, 5, tanh))
 ```
 
 Equivalent to the `RNN` stateful constructor, `LSTM` and `GRU` are also available. 
@@ -72,101 +72,104 @@ Equivalent to the `RNN` stateful constructor, `LSTM` and `GRU` are also availabl
 Using these tools, we can now build the model shown in the above diagram with: 
 
 ```julia
-m = Chain(RNN(2, 5), Dense(5, 1), x -> reshape(x, :))
+m = Chain(RNN(4, 5), Dense(5, 2))
 ```
+In this example, each output has to components.
 
 ## Working with sequences
 
 Using the previously defined `m` recurrent model, we can now apply it to a single step from our sequence:
 
 ```julia
-x = rand(Float32, 2)
+julia> x = rand(Float32, 4);
+
 julia> m(x)
-1-element Array{Float32,1}:
- 0.028398542
+2-element Vector{Float32}:
+ -0.12852919
+  0.009802654
 ```
 
 The `m(x)` operation would be represented by `x1 -> A -> y1` in our diagram.
-If we perform this operation a second time, it will be equivalent to `x2 -> A -> y2` since the model `m` has stored the state resulting from the `x1` step:
-
-```julia
-x = rand(Float32, 2)
-julia> m(x)
-1-element Array{Float32,1}:
- 0.07381232
-```
+If we perform this operation a second time, it will be equivalent to `x2 -> A -> y2` 
+since the model `m` has stored the state resulting from the `x1` step.
 
-Now, instead of computing a single step at a time, we can get the full `y1` to `y3` sequence in a single pass by broadcasting the model on a sequence of data. 
+Now, instead of computing a single step at a time, we can get the full `y1` to `y3` sequence in a single pass by 
+iterating the model on a sequence of data. 
 
 To do so, we'll need to structure the input data as a `Vector` of observations at each time step. This `Vector` will therefore be of `length = seq_length` and each of its elements will represent the input features for a given step. In our example, this translates into a `Vector` of length 3, where each element is a `Matrix` of size `(features, batch_size)`, or just a `Vector` of length `features` if dealing with a single observation.  
 
 ```julia
-x = [rand(Float32, 2) for i = 1:3]
-julia> m.(x)
-3-element Array{Array{Float32,1},1}:
- [-0.17945863]
- [-0.20863166]
- [-0.20693761]
+julia> x = [rand(Float32, 4) for i = 1:3];
+
+julia> [m(x[i]) for i = 1:3]
+3-element Vector{Vector{Float32}}:
+ [-0.018976994, 0.61098206]
+ [-0.8924057, -0.7512169]
+ [-0.34613007, -0.54565114]
 ```
 
 If for some reason one wants to exclude the first step of the RNN chain for the computation of the loss, that can be handled with:
 
 ```julia
 function loss(x, y)
-  sum((Flux.stack(m.(x)[2:end],1) .- y) .^ 2)
+  m(x[1]) # ignores the output but updates the hidden states
+  l = 0f0
+  for i in 2:length(x)
+    l += sum((m(x[i]) .- y[i-1]).^2)
+  end
+  return l
 end
 
-y = rand(Float32, 2)
-julia> loss(x, y)
-1.7021208968648693
+y = [rand(Float32, 2) for i=1:2]
+loss(x, y)
 ```
 
-In such model, only `y2` and `y3` are used to compute the loss, hence the target `y` being of length 2. This is a strategy that can be used to easily handle a `seq-to-one` kind of structure, compared to the `seq-to-seq` assumed so far.   
+In such model, only the last two outputs are used to compute the loss, hence the target `y` being of length 2. This is a strategy that can be used to easily handle a `seq-to-one` kind of structure, compared to the `seq-to-seq` assumed so far.   
 
 Alternatively, if one wants to perform some warmup of the sequence, it could be performed once, followed with a regular training where all the steps of the sequence would be considered for the gradient update:
 
 ```julia
 function loss(x, y)
-  sum((Flux.stack(m.(x),1) .- y) .^ 2)
+  sum(sum((m(x[i]) .- y[i]).^2) for i=1:length(x))
 end
 
-seq_init = [rand(Float32, 2) for i = 1:1]
-seq_1 = [rand(Float32, 2) for i = 1:3]
-seq_2 = [rand(Float32, 2) for i = 1:3]
+seq_init = [rand(Float32, 4) for i = 1:1]
+seq_1 = [rand(Float32, 4) for i = 1:3]
+seq_2 = [rand(Float32, 4) for i = 1:3]
 
-y1 = rand(Float32, 3)
-y2 = rand(Float32, 3)
+y1 = [rand(Float32, 2) for i = 1:3]
+y2 = [rand(Float32, 2) for i = 1:3]
 
 X = [seq_1, seq_2]
 Y = [y1, y2]
 data = zip(X,Y)
 
 Flux.reset!(m)
-m.(seq_init)
+[m(x) for x in seq_init]
 
 ps = params(m)
 opt= ADAM(1e-3)
 Flux.train!(loss, ps, data, opt)
 ```
 
-In this previous example, model's state is first reset with `Flux.reset!`. Then, there's a warmup that is performed over a sequence of length 1 by feeding it with `seq_init`, resulting in a warmup state. The model can then be trained for 1 epoch, where 2 batches are provided (`seq_1` and `seq_2`) and all the timesteps outputs are considered for the loss (we no longer use a subset of `m.(x)` in the loss function).
+In this previous example, model's state is first reset with `Flux.reset!`. Then, there's a warmup that is performed over a sequence of length 1 by feeding it with `seq_init`, resulting in a warmup state. The model can then be trained for 1 epoch, where 2 batches are provided (`seq_1` and `seq_2`) and all the timesteps outputs are considered for the loss.
 
 In this scenario, it is important to note that a single continuous sequence is considered. Since the model state is not reset between the 2 batches, the state of the model flows through the batches, which only makes sense in the context where `seq_1` is the continuation of `seq_init` and so on.
 
 Batch size would be 1 here as there's only a single sequence within each batch. If the model was to be trained on multiple independent sequences, then these sequences could be added to the input data as a second dimension. For example, in a language model, each batch would contain multiple independent sentences. In such scenario, if we set the batch size to 4, a single batch would be of the shape:
 
 ```julia
-batch = [rand(Float32, 2, 4) for i = 1:3]
+batch = [rand(Float32, 4, 4) for i = 1:3]
 ```
 
-That would mean that we have 4 sentences (or samples), each with 2 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing `m(batch[1])`, would still represent `x1 -> y1` in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix).
+That would mean that we have 4 sentences (or samples), each with 4 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing `m(batch[1])`, would still represent `x1 -> y1` in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix).
 
 In many situations, such as when dealing with a language model, each batch typically contains independent sentences, so we cannot handle the model as if each batch was the direct continuation of the previous one. To handle such situation, we need to reset the state of the model between each batch, which can be conveniently performed within the loss function:
 
 ```julia
 function loss(x, y)
   Flux.reset!(m)
-  sum((Flux.stack(m.(x),1) .- y) .^ 2)
+  sum(sum((m(x[i]) .- y[i]).^2) for i=1:length(x))
 end
 ```
 

From ab24e8ed9f3722cf12f71c7d2b984ac9587ce096 Mon Sep 17 00:00:00 2001
From: CarloLucibello <carlo.lucibello@gmail.com>
Date: Sun, 27 Jun 2021 19:34:19 +0200
Subject: [PATCH 2/5] add warning for map and broadcast

---
 docs/src/models/recurrence.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/docs/src/models/recurrence.md b/docs/src/models/recurrence.md
index 5ab7b0506b..4322587ecb 100644
--- a/docs/src/models/recurrence.md
+++ b/docs/src/models/recurrence.md
@@ -108,6 +108,20 @@ julia> [m(x[i]) for i = 1:3]
  [-0.34613007, -0.54565114]
 ```
 
+!!! warning "Use of map and broadcast"
+    Mapping and broadcasting operations with stateful layers such as the one we are considering are discouraged,
+    since the julia language doesn't guarantee a specific execution order.
+    Therefore, avoid  
+    ```julia
+    y = m.(x)
+    # or 
+    y = map(m, x)
+    ```
+    and use explicit loops 
+    ```julia
+    y = [m(x) for x in x]
+    ```
+  
 If for some reason one wants to exclude the first step of the RNN chain for the computation of the loss, that can be handled with:
 
 ```julia

From cea8f752161415de51d971ad9aa9c2be7a4f1e7a Mon Sep 17 00:00:00 2001
From: Carlo Lucibello <carlo.lucibello@gmail.com>
Date: Sun, 11 Jul 2021 09:58:09 +0200
Subject: [PATCH 3/5] update docs

---
 docs/src/models/recurrence.md | 47 +++++++++++++++++------------------
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/docs/src/models/recurrence.md b/docs/src/models/recurrence.md
index 4322587ecb..1d6e1276fa 100644
--- a/docs/src/models/recurrence.md
+++ b/docs/src/models/recurrence.md
@@ -13,7 +13,7 @@ An aspect to recognize is that in such model, the recurrent cells `A` all refer
 In the most basic RNN case, cell A could be defined by the following: 
 
 ```julia
-Wxh = randn(Float32, 5, 4)
+Wxh = randn(Float32, 5, 2)
 Whh = randn(Float32, 5, 5)
 b   = randn(Float32, 5)
 
@@ -22,7 +22,7 @@ function rnn_cell(h, x)
     return h, h
 end
 
-x = rand(Float32, 4) # dummy data
+x = rand(Float32, 2) # dummy data
 h = rand(Float32, 5)  # initial hidden state
 
 h, y = rnn_cell(h, x)
@@ -37,9 +37,9 @@ There are various recurrent cells available in Flux, notably `RNNCell`, `LSTMCel
 ```julia
 using Flux
 
-rnn = Flux.RNNCell(4, 5)
+rnn = Flux.RNNCell(2, 5)
 
-x = rand(Float32, 4) # dummy data
+x = rand(Float32, 2) # dummy data
 h = rand(Float32, 5)  # initial hidden state
 
 h, y = rnn(h, x)
@@ -50,7 +50,7 @@ h, y = rnn(h, x)
 For the most part, we don't want to manage hidden states ourselves, but to treat our models as being stateful. Flux provides the `Recur` wrapper to do this.
 
 ```julia
-x = rand(Float32, 4)
+x = rand(Float32, 2)
 h = rand(Float32, 5)
 
 m = Flux.Recur(rnn, h)
@@ -60,11 +60,11 @@ y = m(x)
 
 The `Recur` wrapper stores the state between runs in the `m.state` field.
 
-If we use the `RNN(4, 5)` constructor – as opposed to `RNNCell` – you'll see that it's simply a wrapped cell.
+If we use the `RNN(2, 5)` constructor – as opposed to `RNNCell` – you'll see that it's simply a wrapped cell.
 
 ```julia
-julia> RNN(4, 5)
-Recur(RNNCell(4, 5, tanh))
+julia> RNN(2, 5)
+Recur(RNNCell(2, 5, tanh))
 ```
 
 Equivalent to the `RNN` stateful constructor, `LSTM` and `GRU` are also available. 
@@ -72,7 +72,7 @@ Equivalent to the `RNN` stateful constructor, `LSTM` and `GRU` are also availabl
 Using these tools, we can now build the model shown in the above diagram with: 
 
 ```julia
-m = Chain(RNN(4, 5), Dense(5, 2))
+m = Chain(RNN(2, 5), Dense(5, 2))
 ```
 In this example, each output has to components.
 
@@ -81,7 +81,7 @@ In this example, each output has to components.
 Using the previously defined `m` recurrent model, we can now apply it to a single step from our sequence:
 
 ```julia
-julia> x = rand(Float32, 4);
+julia> x = rand(Float32, 2);
 
 julia> m(x)
 2-element Vector{Float32}:
@@ -99,9 +99,9 @@ iterating the model on a sequence of data.
 To do so, we'll need to structure the input data as a `Vector` of observations at each time step. This `Vector` will therefore be of `length = seq_length` and each of its elements will represent the input features for a given step. In our example, this translates into a `Vector` of length 3, where each element is a `Matrix` of size `(features, batch_size)`, or just a `Vector` of length `features` if dealing with a single observation.  
 
 ```julia
-julia> x = [rand(Float32, 4) for i = 1:3];
+julia> x = [rand(Float32, 2) for i = 1:3];
 
-julia> [m(x[i]) for i = 1:3]
+julia> [m(xi) for xi in x]
 3-element Vector{Vector{Float32}}:
  [-0.018976994, 0.61098206]
  [-0.8924057, -0.7512169]
@@ -109,7 +109,7 @@ julia> [m(x[i]) for i = 1:3]
 ```
 
 !!! warning "Use of map and broadcast"
-    Mapping and broadcasting operations with stateful layers such as the one we are considering are discouraged,
+    Mapping and broadcasting operations with stateful layers such are discouraged,
     since the julia language doesn't guarantee a specific execution order.
     Therefore, avoid  
     ```julia
@@ -125,12 +125,11 @@ julia> [m(x[i]) for i = 1:3]
 If for some reason one wants to exclude the first step of the RNN chain for the computation of the loss, that can be handled with:
 
 ```julia
+using Flux.Losses: mse
+
 function loss(x, y)
   m(x[1]) # ignores the output but updates the hidden states
-  l = 0f0
-  for i in 2:length(x)
-    l += sum((m(x[i]) .- y[i-1]).^2)
-  end
+  l = sum(mse(m(xi), yi) for (xi, yi) in zip(x[2:end], y))
   return l
 end
 
@@ -144,12 +143,12 @@ Alternatively, if one wants to perform some warmup of the sequence, it could be
 
 ```julia
 function loss(x, y)
-  sum(sum((m(x[i]) .- y[i]).^2) for i=1:length(x))
+  sum(mse(m(xi), yi) for (xi, yi) in zip(x, y))
 end
 
-seq_init = [rand(Float32, 4) for i = 1:1]
-seq_1 = [rand(Float32, 4) for i = 1:3]
-seq_2 = [rand(Float32, 4) for i = 1:3]
+seq_init = [rand(Float32, 2)]
+seq_1 = [rand(Float32, 2) for i = 1:3]
+seq_2 = [rand(Float32, 2) for i = 1:3]
 
 y1 = [rand(Float32, 2) for i = 1:3]
 y2 = [rand(Float32, 2) for i = 1:3]
@@ -173,17 +172,17 @@ In this scenario, it is important to note that a single continuous sequence is c
 Batch size would be 1 here as there's only a single sequence within each batch. If the model was to be trained on multiple independent sequences, then these sequences could be added to the input data as a second dimension. For example, in a language model, each batch would contain multiple independent sentences. In such scenario, if we set the batch size to 4, a single batch would be of the shape:
 
 ```julia
-batch = [rand(Float32, 4, 4) for i = 1:3]
+batch = [rand(Float32, 2, 4) for i = 1:3]
 ```
 
-That would mean that we have 4 sentences (or samples), each with 4 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing `m(batch[1])`, would still represent `x1 -> y1` in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix).
+That would mean that we have 4 sentences (or samples), each with 2 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing `m(batch[1])`, would still represent `x1 -> y1` in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix).
 
 In many situations, such as when dealing with a language model, each batch typically contains independent sentences, so we cannot handle the model as if each batch was the direct continuation of the previous one. To handle such situation, we need to reset the state of the model between each batch, which can be conveniently performed within the loss function:
 
 ```julia
 function loss(x, y)
   Flux.reset!(m)
-  sum(sum((m(x[i]) .- y[i]).^2) for i=1:length(x))
+  sum(mse(m(xi), yi) for (xi, yi) in zip(x, y))
 end
 ```
 

From 300f5c236fcfbb7f1acbe832182e3891f187d2f5 Mon Sep 17 00:00:00 2001
From: Carlo Lucibello <carlo.lucibello@gmail.com>
Date: Sun, 11 Jul 2021 10:06:03 +0200
Subject: [PATCH 4/5] output size 1

---
 docs/src/models/recurrence.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/src/models/recurrence.md b/docs/src/models/recurrence.md
index 1d6e1276fa..28fc34140a 100644
--- a/docs/src/models/recurrence.md
+++ b/docs/src/models/recurrence.md
@@ -72,9 +72,9 @@ Equivalent to the `RNN` stateful constructor, `LSTM` and `GRU` are also availabl
 Using these tools, we can now build the model shown in the above diagram with: 
 
 ```julia
-m = Chain(RNN(2, 5), Dense(5, 2))
+m = Chain(RNN(2, 5), Dense(5, 1))
 ```
-In this example, each output has to components.
+In this example, each output has two components.
 
 ## Working with sequences
 
@@ -129,15 +129,14 @@ using Flux.Losses: mse
 
 function loss(x, y)
   m(x[1]) # ignores the output but updates the hidden states
-  l = sum(mse(m(xi), yi) for (xi, yi) in zip(x[2:end], y))
-  return l
+  sum(mse(m(xi), yi) for (xi, yi) in zip(x[2:end], y))
 end
 
-y = [rand(Float32, 2) for i=1:2]
+y = [rand(Float32, 1) for i=1:2]
 loss(x, y)
 ```
 
-In such model, only the last two outputs are used to compute the loss, hence the target `y` being of length 2. This is a strategy that can be used to easily handle a `seq-to-one` kind of structure, compared to the `seq-to-seq` assumed so far.   
+In such a model, only the last two outputs are used to compute the loss, hence the target `y` being of length 2. This is a strategy that can be used to easily handle a `seq-to-one` kind of structure, compared to the `seq-to-seq` assumed so far.   
 
 Alternatively, if one wants to perform some warmup of the sequence, it could be performed once, followed with a regular training where all the steps of the sequence would be considered for the gradient update:
 
@@ -150,8 +149,8 @@ seq_init = [rand(Float32, 2)]
 seq_1 = [rand(Float32, 2) for i = 1:3]
 seq_2 = [rand(Float32, 2) for i = 1:3]
 
-y1 = [rand(Float32, 2) for i = 1:3]
-y2 = [rand(Float32, 2) for i = 1:3]
+y1 = [rand(Float32, 1) for i = 1:3]
+y2 = [rand(Float32, 1) for i = 1:3]
 
 X = [seq_1, seq_2]
 Y = [y1, y2]
@@ -172,7 +171,8 @@ In this scenario, it is important to note that a single continuous sequence is c
 Batch size would be 1 here as there's only a single sequence within each batch. If the model was to be trained on multiple independent sequences, then these sequences could be added to the input data as a second dimension. For example, in a language model, each batch would contain multiple independent sentences. In such scenario, if we set the batch size to 4, a single batch would be of the shape:
 
 ```julia
-batch = [rand(Float32, 2, 4) for i = 1:3]
+x = [rand(Float32, 2, 4) for i = 1:3]
+y = [rand(Float32, 1, 4) for i = 1:3]
 ```
 
 That would mean that we have 4 sentences (or samples), each with 2 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing `m(batch[1])`, would still represent `x1 -> y1` in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix).

From 010c0bb95b327140be1616c242dbcd61d3567722 Mon Sep 17 00:00:00 2001
From: Carlo Lucibello <carlo.lucibello@gmail.com>
Date: Sun, 11 Jul 2021 10:09:09 +0200
Subject: [PATCH 5/5] fix doc

---
 docs/src/models/recurrence.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/models/recurrence.md b/docs/src/models/recurrence.md
index 28fc34140a..ba8756e64f 100644
--- a/docs/src/models/recurrence.md
+++ b/docs/src/models/recurrence.md
@@ -74,7 +74,7 @@ Using these tools, we can now build the model shown in the above diagram with:
 ```julia
 m = Chain(RNN(2, 5), Dense(5, 1))
 ```
-In this example, each output has two components.
+In this example, each output has only one component.
 
 ## Working with sequences