JuliaTeachingCTU · masenka31 · Nov 27, 2023 · Nov 27, 2023
diff --git a/docs/src/lecture_02/arrays.md b/docs/src/lecture_02/arrays.md
@@ -448,7 +448,7 @@ ERROR: ArgumentError: number of columns of each array must match (got (4, 1))
 <div class="admonition-body">
 ```
 
-Create two vectors: vector of all odd positive integers smaller than `10` and vector of all even positive integers smaller than `10`. Then concatenate these two vectors horizontally and fill the third row with `4`.
+Create two vectors: vector of all odd positive integers smaller than `10` and vector of all even positive integers smaller than or equal to `10`. Then concatenate these two vectors horizontally and fill the third row with `4`.
 
 ```@raw html
 </div></div>

diff --git a/docs/src/lecture_11/Iris_train_test_acc.svg b/docs/src/lecture_11/Iris_train_test_acc.svg
diff --git a/docs/src/lecture_11/iris.md b/docs/src/lecture_11/iris.md
@@ -41,8 +41,8 @@ using Flux
 
 n_hidden = 5
 m = Chain(
-    Dense(size(X_train,1), n_hidden, relu),
-    Dense(n_hidden, size(y_train,1), identity),
+    Dense(size(X_train,1) => n_hidden, relu),
+    Dense(n_hidden => size(y_train,1), identity),
     softmax,
 )
 
@@ -59,7 +59,7 @@ m(X_train)
 
 Because there are ``3`` classes and ``120`` samples in the training set, it returns an array of size ``3\times 120``. Each column corresponds to one sample and forms a vector of probabilities due to the last layer of softmax.
 
-We access the neural network parameters by using `params(m)`. We can select the second layer of `m` by `m[2]`. Since the second layer has ``5 `` input and ``3`` output neurons, its parameters are a matrix of size ``3\times 5`` and a vector of length ``3``. The parameters `params(m[2])` are a tuple of the matrix and the vector. This also implies that the parameters are initialized randomly, and we do not need to take care of it. We can easily modify any parameters.
+We access the neural network parameters by using `params(m)`. We can select the second layer of `m` by `m[2]`. Since the second layer has ``5 `` input and ``3`` output neurons, its parameters are a matrix of size ``3\times 5`` and a vector of length ``3``. The parameters `params(m[2])` are a tuple of the matrix and the vector. This also implies that the parameters are initialized randomly, and we do not need to take care of it. We can also easily modify any parameters.
 
 ```@example iris
 using Flux: params
@@ -76,17 +76,19 @@ To train the network, we need to define the objective function ``L``. Since we a
 ```@example iris
 using Flux: crossentropy
 
-L(x,y) = crossentropy(m(x), y)
+L(ŷ, y) = crossentropy(ŷ, y)
 
 nothing # hide
 ```
 
-The `loss` function does not have `m` as input. Even though there could be an additional input parameter, it is customary to write it without it. We can evaluate the objective function by
+The `loss` function should be defined between predicted $\hat{y}$ and true label $y$. Therefore, we can evaluate the objective function by
 
 ```@example iris
-L(X_train, y_train)
+L(m(X_train), y_train)
 ```
 
+where `ŷ = m(x)`.
+
 This computes the objective function on the whole training set. Since Flux is (unlike our implementation from the last lecture) smart, there is no need to take care of individual samples.
 
 !!! info "Notation:"
@@ -95,46 +97,48 @@ This computes the objective function on the whole training set. Since Flux is (u
 Since we have the model and the loss function, the only remaining thing is the gradient. Flux again provides a smart way to compute it.
 
 ```@example iris
-ps = params(m)
-grad = gradient(() -> L(X_train, y_train), ps)
+grads = Flux.gradient(m -> L(m(X_train), y_train), m)
 
 nothing # hide
 ```
 
-The function `gradient` takes two inputs. The first one is the function we want to differentiate, and the second one are the parameters. The `L` function needs to be evaluated at the correct points `X_train` and `y_train`. In some applications, we may need to differentiate with respect to other parameters such as `X_train`. This can be achieved by changing the second parameters of the `gradient` function.
+The function `gradient` takes as inputs a function to differentiate, and arguments that specify the parameters we want to differentiate with respect to. Since the argument is the model `m` itself, the gradient is taken with respect to the parameters of `m`. The `L` function needs to be evaluated at the correct points `m(X_train)` (predictions) and `y_train` (true labels).
 
-```@example iris
-grad = gradient(() -> L(X_train, y_train), params(X_train))
+The `grads` structure is a tuple holding a named tuple with the `:layers` key. Each layer then holds the parameters of the model, in this case, the weights $W$, bias $b$, and optionally parameters of the activation function $\sigma$.
 
-size(grad[X_train])
+```julia
+julia> grads[1][:layers][2]
+(weight = Float32[0.30140522 0.007785671 … -0.070617765 0.014230583; 0.06814249 -0.07018863 … 0.17996183 -0.20995824; -0.36954764 0.062402964 … -0.10934405 0.19572766], bias = Float32[0.0154182855, 0.022615476, -0.03803377], σ = nothing)
 ```
 
-Since `X_train` has shape ``4\times 120``, the gradient needs to have the same size.
-
-We train the classifiers for 250 iterations. In each iteration, we compute the gradient with respect to all network parameters and perform the gradient descent with stepsize ``0.1``.
+Now, we train the classifiers for 250 iterations. In each iteration, we compute the gradient with respect to all network parameters and perform the gradient descent with stepsize ``0.1``. Since [email protected], there's been a change from implicit definition to explicit definition of optimisers. Since now, we need to use `Flux.setup(optimiser, model)` to create an optimiser that would optimise over the model's parameters.
 
 ```@example iris
 opt = Descent(0.1)
+opt_state = Flux.setup(opt, m)
 max_iter = 250
 
+acc_train = zeros(max_iter)
 acc_test = zeros(max_iter)
 for i in 1:max_iter
-    gs = gradient(() -> L(X_train, y_train), ps)
-    Flux.Optimise.update!(opt, ps, gs)
+    gs = Flux.gradient(m -> L(m(X_train), y_train), m)
+    Flux.update!(opt_state, m, gs[1])
+    acc_train[i] = accuracy(X_train, y_train)
     acc_test[i] = accuracy(X_test, y_test)
 end
 
 nothing # hide
 ```
 
-The accuracy on the testing set keeps increasing as the training progresses.
+Both the accuracy on the training and testing set keeps increasing as the training progresses. This is a good check that we are not over-fitting.
 
 ```@example iris
 using Plots
 
-plot(acc_test, xlabel="Iteration", ylabel="Test accuracy", label="", ylim=(-0.01,1.01))
+plot(acc_train, xlabel="Iteration", ylabel="Accuracy", label="train", ylim=(-0.01,1.01))
+plot!(acc_test, xlabel="Iteration", label="test", ylim=(-0.01,1.01))
 
-savefig("Iris_acc.svg") # hide
+savefig("Iris_train_test_acc.svg") # hide
 ```
 
-![](Iris_acc.svg)
+![](Iris_train_test_acc.svg)
diff --git a/docs/src/lecture_11/nn.md b/docs/src/lecture_11/nn.md
@@ -21,7 +21,7 @@ This lecture shows how to train more complex networks using stochastic gradient
 
 ## Preparing data
 
-During the last lecture, we implemented everything from scratch. This lecture will introduce the package [Flux](https://fluxml.ai/Flux.jl/stable/models/basics/) which automizes most of the things needed for neural networks.
+During the last lecture, we implemented everything from scratch. This lecture will introduce the package [Flux](https://fluxml.ai/Flux.jl/stable/models/basics/) (and [Optimisers](https://fluxml.ai/Optimisers.jl/stable/)) which automizes most of the things needed for neural networks.
 - It creates many layers, including convolutional layers.
 - It creates the model by chaining layers together.
 - It efficiently represents model parameters.
@@ -381,35 +381,39 @@ m = Chain(
 nothing # hide
 ```
 
-The objective function ``L`` then applies the cross-entropy loss to the predictions and labels.
+The objective function ``L`` then applies the cross-entropy loss to the predictions and labels. For us to be able to use `Flux.Optimise.train!` function to easily train a neural network, we will define the loss $\operatorname{L}$ as
 
 ```@example nn
 using Flux: crossentropy
 
-L(X, y) = crossentropy(m(X), y)
+L(model, X, y) = crossentropy(model(X), y)
 
 nothing # hide
 ```
 
 We now write the function `train_model!` to train the neural network `m`. Since this function modifies the input model `m`, its name should contain the exclamation mark. Besides the loss function `L`, data `X` and labels `y`, it also contains as keyword arguments optimizer the optimizer `opt`, the minibatch size `batchsize`, the number of epochs `n_epochs`, and the file name `file_name` to which the model should be saved.
 
+!!! info "Optimiser and optimiser state:"
+    Note that we have to initialize the optimiser state `opt_state`. For a simple gradient descent `Descent(learning_rate)`, there is no internal state of the optimiser and internal parameters. However, when using different parametrized optimisers such as Adam, the internal state of `opt_state` is updated in each iteration, just as the parameters of the model. Therefore, if we want to save a model and continue its training later on, we need to save both the model (or its parameters) and the optimiser state.
+
+
 ```@example nn
 using BSON
-using Flux: params
 
 function train_model!(m, L, X, y;
         opt = Descent(0.1),
         batchsize = 128,
         n_epochs = 10,
         file_name = "")
 
+    opt_state = Flux.setup(opt, m)
     batches = DataLoader((X, y); batchsize, shuffle = true)
 
     for _ in 1:n_epochs
-        Flux.train!(L, params(m), batches, opt)
+        Flux.train!(L, m, batches, opt_state)
     end
 
-    !isempty(file_name) && BSON.bson(file_name, m=m)
+    !isempty(file_name) && BSON.bson(file_name, m=m, opt_state=opt_state)
 
     return
 end
@@ -498,7 +502,7 @@ Use this function to load the model from `data/mnist.bson` and evaluate the perf
 
 The optional arguments should contain `kwargs...`, which will be passed to `train_model!`. Besides that, we include `force` which enforces that the model is trained even if it already exists.
 
-First, we should check whether the directory exists ```!isdir(dirname(file_name))``` and if not, we create it ```mkpath(dirname(file_name))```. Then we check whether the file exists (or whether we want to enforce the training). If yes, we train the model, which already modifies ```m```. If not, we ```BSON.load``` the model and copy the loaded parameters into ```m``` by ```Flux.loadparams!(m, params(m_loaded))```. We cannot load directly into ```m``` instead of ```m_loaded``` because that would create a local copy of ```m``` and the function would not modify the external ```m```.
+First, we should check whether the directory exists ```!isdir(dirname(file_name))``` and if not, we create it ```mkpath(dirname(file_name))```. Then we check whether the file exists (or whether we want to enforce the training). If yes, we train the model, which already modifies ```m```. If not, we ```BSON.load``` the model and copy the loaded parameters into ```m``` by ```Flux.loadparams!(m, Flux.params(m_loaded))```. We cannot load directly into ```m``` instead of ```m_loaded``` because that would create a local copy of ```m``` and the function would not modify the external ```m```.
 
 ```@example nn
 function train_or_load!(file_name, m, args...; force=false, kwargs...)
@@ -509,7 +513,7 @@ function train_or_load!(file_name, m, args...; force=false, kwargs...)
         train_model!(m, args...; file_name=file_name, kwargs...)
     else
         m_weights = BSON.load(file_name)[:m]
-        Flux.loadparams!(m, params(m_weights))
+        Flux.loadparams!(m, Flux.params(m_weights))
     end
 end