diff --git a/docs/src/lecture_02/arrays.md b/docs/src/lecture_02/arrays.md
index 4e3b0f7a8..4a176591c 100644
--- a/docs/src/lecture_02/arrays.md
+++ b/docs/src/lecture_02/arrays.md
@@ -448,7 +448,7 @@ ERROR: ArgumentError: number of columns of each array must match (got (4, 1))
```
-Create two vectors: vector of all odd positive integers smaller than `10` and vector of all even positive integers smaller than `10`. Then concatenate these two vectors horizontally and fill the third row with `4`.
+Create two vectors: vector of all odd positive integers smaller than `10` and vector of all even positive integers smaller than or equal to `10`. Then concatenate these two vectors horizontally and fill the third row with `4`.
```@raw html
diff --git a/docs/src/lecture_11/Iris_train_test_acc.svg b/docs/src/lecture_11/Iris_train_test_acc.svg
new file mode 100644
index 000000000..9f2a0ebbc
--- /dev/null
+++ b/docs/src/lecture_11/Iris_train_test_acc.svg
@@ -0,0 +1,50 @@
+
+
diff --git a/docs/src/lecture_11/iris.md b/docs/src/lecture_11/iris.md
index c2fbb3aa1..56c012b2b 100644
--- a/docs/src/lecture_11/iris.md
+++ b/docs/src/lecture_11/iris.md
@@ -41,8 +41,8 @@ using Flux
n_hidden = 5
m = Chain(
- Dense(size(X_train,1), n_hidden, relu),
- Dense(n_hidden, size(y_train,1), identity),
+ Dense(size(X_train,1) => n_hidden, relu),
+ Dense(n_hidden => size(y_train,1), identity),
softmax,
)
@@ -59,7 +59,7 @@ m(X_train)
Because there are ``3`` classes and ``120`` samples in the training set, it returns an array of size ``3\times 120``. Each column corresponds to one sample and forms a vector of probabilities due to the last layer of softmax.
-We access the neural network parameters by using `params(m)`. We can select the second layer of `m` by `m[2]`. Since the second layer has ``5 `` input and ``3`` output neurons, its parameters are a matrix of size ``3\times 5`` and a vector of length ``3``. The parameters `params(m[2])` are a tuple of the matrix and the vector. This also implies that the parameters are initialized randomly, and we do not need to take care of it. We can easily modify any parameters.
+We access the neural network parameters by using `params(m)`. We can select the second layer of `m` by `m[2]`. Since the second layer has ``5 `` input and ``3`` output neurons, its parameters are a matrix of size ``3\times 5`` and a vector of length ``3``. The parameters `params(m[2])` are a tuple of the matrix and the vector. This also implies that the parameters are initialized randomly, and we do not need to take care of it. We can also easily modify any parameters.
```@example iris
using Flux: params
@@ -76,17 +76,19 @@ To train the network, we need to define the objective function ``L``. Since we a
```@example iris
using Flux: crossentropy
-L(x,y) = crossentropy(m(x), y)
+L(ŷ, y) = crossentropy(ŷ, y)
nothing # hide
```
-The `loss` function does not have `m` as input. Even though there could be an additional input parameter, it is customary to write it without it. We can evaluate the objective function by
+The `loss` function should be defined between predicted $\hat{y}$ and true label $y$. Therefore, we can evaluate the objective function by
```@example iris
-L(X_train, y_train)
+L(m(X_train), y_train)
```
+where `ŷ = m(x)`.
+
This computes the objective function on the whole training set. Since Flux is (unlike our implementation from the last lecture) smart, there is no need to take care of individual samples.
!!! info "Notation:"
@@ -95,46 +97,48 @@ This computes the objective function on the whole training set. Since Flux is (u
Since we have the model and the loss function, the only remaining thing is the gradient. Flux again provides a smart way to compute it.
```@example iris
-ps = params(m)
-grad = gradient(() -> L(X_train, y_train), ps)
+grads = Flux.gradient(m -> L(m(X_train), y_train), m)
nothing # hide
```
-The function `gradient` takes two inputs. The first one is the function we want to differentiate, and the second one are the parameters. The `L` function needs to be evaluated at the correct points `X_train` and `y_train`. In some applications, we may need to differentiate with respect to other parameters such as `X_train`. This can be achieved by changing the second parameters of the `gradient` function.
+The function `gradient` takes as inputs a function to differentiate, and arguments that specify the parameters we want to differentiate with respect to. Since the argument is the model `m` itself, the gradient is taken with respect to the parameters of `m`. The `L` function needs to be evaluated at the correct points `m(X_train)` (predictions) and `y_train` (true labels).
-```@example iris
-grad = gradient(() -> L(X_train, y_train), params(X_train))
+The `grads` structure is a tuple holding a named tuple with the `:layers` key. Each layer then holds the parameters of the model, in this case, the weights $W$, bias $b$, and optionally parameters of the activation function $\sigma$.
-size(grad[X_train])
+```julia
+julia> grads[1][:layers][2]
+(weight = Float32[0.30140522 0.007785671 … -0.070617765 0.014230583; 0.06814249 -0.07018863 … 0.17996183 -0.20995824; -0.36954764 0.062402964 … -0.10934405 0.19572766], bias = Float32[0.0154182855, 0.022615476, -0.03803377], σ = nothing)
```
-Since `X_train` has shape ``4\times 120``, the gradient needs to have the same size.
-
-We train the classifiers for 250 iterations. In each iteration, we compute the gradient with respect to all network parameters and perform the gradient descent with stepsize ``0.1``.
+Now, we train the classifiers for 250 iterations. In each iteration, we compute the gradient with respect to all network parameters and perform the gradient descent with stepsize ``0.1``. Since Flux@0.14, there's been a change from implicit definition to explicit definition of optimisers. Since now, we need to use `Flux.setup(optimiser, model)` to create an optimiser that would optimise over the model's parameters.
```@example iris
opt = Descent(0.1)
+opt_state = Flux.setup(opt, m)
max_iter = 250
+acc_train = zeros(max_iter)
acc_test = zeros(max_iter)
for i in 1:max_iter
- gs = gradient(() -> L(X_train, y_train), ps)
- Flux.Optimise.update!(opt, ps, gs)
+ gs = Flux.gradient(m -> L(m(X_train), y_train), m)
+ Flux.update!(opt_state, m, gs[1])
+ acc_train[i] = accuracy(X_train, y_train)
acc_test[i] = accuracy(X_test, y_test)
end
nothing # hide
```
-The accuracy on the testing set keeps increasing as the training progresses.
+Both the accuracy on the training and testing set keeps increasing as the training progresses. This is a good check that we are not over-fitting.
```@example iris
using Plots
-plot(acc_test, xlabel="Iteration", ylabel="Test accuracy", label="", ylim=(-0.01,1.01))
+plot(acc_train, xlabel="Iteration", ylabel="Accuracy", label="train", ylim=(-0.01,1.01))
+plot!(acc_test, xlabel="Iteration", label="test", ylim=(-0.01,1.01))
-savefig("Iris_acc.svg") # hide
+savefig("Iris_train_test_acc.svg") # hide
```
-![](Iris_acc.svg)
+![](Iris_train_test_acc.svg)
diff --git a/docs/src/lecture_11/nn.md b/docs/src/lecture_11/nn.md
index 6f8e1bd37..72f281820 100644
--- a/docs/src/lecture_11/nn.md
+++ b/docs/src/lecture_11/nn.md
@@ -21,7 +21,7 @@ This lecture shows how to train more complex networks using stochastic gradient
## Preparing data
-During the last lecture, we implemented everything from scratch. This lecture will introduce the package [Flux](https://fluxml.ai/Flux.jl/stable/models/basics/) which automizes most of the things needed for neural networks.
+During the last lecture, we implemented everything from scratch. This lecture will introduce the package [Flux](https://fluxml.ai/Flux.jl/stable/models/basics/) (and [Optimisers](https://fluxml.ai/Optimisers.jl/stable/)) which automizes most of the things needed for neural networks.
- It creates many layers, including convolutional layers.
- It creates the model by chaining layers together.
- It efficiently represents model parameters.
@@ -381,21 +381,24 @@ m = Chain(
nothing # hide
```
-The objective function ``L`` then applies the cross-entropy loss to the predictions and labels.
+The objective function ``L`` then applies the cross-entropy loss to the predictions and labels. For us to be able to use `Flux.Optimise.train!` function to easily train a neural network, we will define the loss $\operatorname{L}$ as
```@example nn
using Flux: crossentropy
-L(X, y) = crossentropy(m(X), y)
+L(model, X, y) = crossentropy(model(X), y)
nothing # hide
```
We now write the function `train_model!` to train the neural network `m`. Since this function modifies the input model `m`, its name should contain the exclamation mark. Besides the loss function `L`, data `X` and labels `y`, it also contains as keyword arguments optimizer the optimizer `opt`, the minibatch size `batchsize`, the number of epochs `n_epochs`, and the file name `file_name` to which the model should be saved.
+!!! info "Optimiser and optimiser state:"
+ Note that we have to initialize the optimiser state `opt_state`. For a simple gradient descent `Descent(learning_rate)`, there is no internal state of the optimiser and internal parameters. However, when using different parametrized optimisers such as Adam, the internal state of `opt_state` is updated in each iteration, just as the parameters of the model. Therefore, if we want to save a model and continue its training later on, we need to save both the model (or its parameters) and the optimiser state.
+
+
```@example nn
using BSON
-using Flux: params
function train_model!(m, L, X, y;
opt = Descent(0.1),
@@ -403,13 +406,14 @@ function train_model!(m, L, X, y;
n_epochs = 10,
file_name = "")
+ opt_state = Flux.setup(opt, m)
batches = DataLoader((X, y); batchsize, shuffle = true)
for _ in 1:n_epochs
- Flux.train!(L, params(m), batches, opt)
+ Flux.train!(L, m, batches, opt_state)
end
- !isempty(file_name) && BSON.bson(file_name, m=m)
+ !isempty(file_name) && BSON.bson(file_name, m=m, opt_state=opt_state)
return
end
@@ -498,7 +502,7 @@ Use this function to load the model from `data/mnist.bson` and evaluate the perf
The optional arguments should contain `kwargs...`, which will be passed to `train_model!`. Besides that, we include `force` which enforces that the model is trained even if it already exists.
-First, we should check whether the directory exists ```!isdir(dirname(file_name))``` and if not, we create it ```mkpath(dirname(file_name))```. Then we check whether the file exists (or whether we want to enforce the training). If yes, we train the model, which already modifies ```m```. If not, we ```BSON.load``` the model and copy the loaded parameters into ```m``` by ```Flux.loadparams!(m, params(m_loaded))```. We cannot load directly into ```m``` instead of ```m_loaded``` because that would create a local copy of ```m``` and the function would not modify the external ```m```.
+First, we should check whether the directory exists ```!isdir(dirname(file_name))``` and if not, we create it ```mkpath(dirname(file_name))```. Then we check whether the file exists (or whether we want to enforce the training). If yes, we train the model, which already modifies ```m```. If not, we ```BSON.load``` the model and copy the loaded parameters into ```m``` by ```Flux.loadparams!(m, Flux.params(m_loaded))```. We cannot load directly into ```m``` instead of ```m_loaded``` because that would create a local copy of ```m``` and the function would not modify the external ```m```.
```@example nn
function train_or_load!(file_name, m, args...; force=false, kwargs...)
@@ -509,7 +513,7 @@ function train_or_load!(file_name, m, args...; force=false, kwargs...)
train_model!(m, args...; file_name=file_name, kwargs...)
else
m_weights = BSON.load(file_name)[:m]
- Flux.loadparams!(m, params(m_weights))
+ Flux.loadparams!(m, Flux.params(m_weights))
end
end