Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create performance tips docs section #615

Merged
merged 14 commits into from
Feb 19, 2019
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ makedocs(modules=[Flux, NNlib],
"One-Hot Encoding" => "data/onehot.md",
"GPU Support" => "gpu.md",
"Saving & Loading" => "saving.md",
"Performance Tips" => "performance.md",
"Internals" =>
["Backpropagation" => "internals/tracker.md"],
"Community" => "community.md"])
Expand Down
76 changes: 76 additions & 0 deletions docs/src/performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Performance Tips

All the usual [Julia performance tips apply](https://docs.julialang.org/en/v1/manual/performance-tips/).
As always [profiling your code](https://docs.julialang.org/en/v1/manual/profile/#Profiling-1) is generally a useful way of finding bottlenecks.
Below follow some Flux specific tips/reminders.

## Don't use more precision than you need.

Flux works great with all kinds of number types.
But often you do not need to be working with say `Float64` (let alone `BigFloat`).
Switching to `Float32` can give you a significant speed up,
not because the operations are faster, but because the memory usage is halved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems relevant to mention Float32 on GPU here. Also, operations do tend to be faster since you can fit more numbers in a SIMD lane at a given size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to use the github suggestion feature, or to PR after it is merged.
I know little of GPU so someone else is better to write it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just remove the "not because the operations are faster, but because the memory usage is halved." part?

Which means allocations occur much faster.
And you use less memory.


## Make sure your custom activation functions preserve the type of their inputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add empty line

Not only should your activation functions be [type-stable](https://docs.julialang.org/en/v1/manual/performance-tips/#Write-%22type-stable%22-functions-1),
they should also preserve the type of their inputs.

A very artificial example using an activatioon function like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"activatioon"


```
my_tanh(x) = Float64(tanh(x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit artificial, perhaps something like tanh(x) + 2.0 or 5.0 * tanh(x) etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted something very obvious, for the first example, the one below is less obvious.

```

will result in performance on `Float32` input orders of magnitude slower than the normal `tanh` would,
because it results in having to use slow mixed type multiplication in the dense layers.

Which means if you change your data say from `Float64` to `Float32` (which should give a speedup: see above),
you will see a large slow-down

This can occur sneakily, because you can cause type-promotion by interacting with a numeric literals.
E.g. the following will have run into the same problem as above:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
E.g. the following will have run into the same problem as above:
E.g. the following will run into the same problem as above:


```
leaky_tanh(x) = 0.01x + tanh(x)
```

While one could change your activation function (e.g. to use `0.01f0x`) to avoid this when ever your inputs change,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While one could change your activation function (e.g. to use `0.01f0x`) to avoid this when ever your inputs change,
While one could change the activation function (e.g. to use `0.01f0x`) to avoid this when ever your inputs change,

the idiomatic (and safe way) is to use `oftype`.

```
leaky_tanh(x) = oftype(x/1, 0.01) + tanh(x)
```


## Evaluate batches as Matrices of features, rather than sequences of Vector features

While it can sometimes be tempting to process your observations (feature vectors) one at a time
e.g.
```julia
function loss_total(xs::AbstractVector{<:Vector}, ys::AbstractVector{<:Vector})
sum(zip(xs, ys)) do (x,y_target)
oxinabox marked this conversation as resolved.
Show resolved Hide resolved
y_pred = model(x) # evaluate the model
return loss(y_pred, y_target)
end
end
```

It is much faster to concatenate them into a matrix,
as this will hit BLAS matrix-matrix multiplication, which is much faster than the equivalent sequence of matrix-vector multiplications.
Even though this means allocating new memory to store them continuously.
oxinabox marked this conversation as resolved.
Show resolved Hide resolved

```julia
x_batch = reduce(hcat, xs)
y_batch = reduce(hcat, ys)
...
function loss_total(x_batch::Matrix, y_batch::Matrix)
y_preds = model(x_batch)
sum(loss.(y_preds, y_batch))
end
```

When doing this kind of concatenation use `reduce(hcat, xs)` rather than `hcat(xs...)`.
This will avoid the splatting penality, and will hit the optimised `reduce` method.