Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add documentation for image layouts #1705

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@ makedocs(modules = [Flux, NNlib, Functors],
"Functors" => "models/functors.md"],
"Handling Data" =>
["One-Hot Encoding" => "data/onehot.md",
"DataLoader" => "data/dataloader.md"],
"DataLoader" => "data/dataloader.md",
"Image layouts" => "data/image_layouts.md",
],
"Training Models" =>
["Optimisers" => "training/optimisers.md",
"Training" => "training/training.md"],
Expand Down
234 changes: 234 additions & 0 deletions docs/src/data/image_layouts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
# Handling different image layouts

!!! tip
The contents here are specific for images but similar reasoning can be applied to other data.

When using Flux to do image processing and computer vision tasks, there are often a few different
data layouts you'll encounter before passing your data into a network. For example, if you are trying to
represent 64 2-dimensional RGB images of size 32x32, you could possibly store them using any of the
following formats:

* Vector of arrays, where each array is an array of struct: `Vector{Matrix{RGB{Float32}}}`
* Vector of plain arrays: `Vector{Array{Float32, 3}}`
* Array of struct: `Array{RGB{Float32}, 3}`
* plain array: `Array{Float32, 4}`

Even within the plain array case, there are various data memory orders depending on which dimensions we put the
channel (C), height (H), width (H), and batch (N) information. Among all the possible combinations,
two of them are used most commonly in deep learning frameworks:

* WHCN: This is the Flux standard because Julia is column-major.
* NCHW: Pytorch standard because it uses row-major by default.
* CHWN: If we unroll the data of `Vector{Matrix{RGB{Float32}}}` without manipulate memory, we get
this order.

Depending on what ecosystem you are referring to to, all of these layouts and memory orders are
useful. This page walks you through the transformaion between these data layouts. we will
also explain some of the confusions you might meet, or in other words, explain why we need all these
different layouts and why we can't just enforce one as a "standard."

We will start with a quick lookup table style so you can easily find out what you need. Then we
focus on the reasoning behind these layouts and their concrete usages in different ecosystems so you
will know how to choose the most appropriate layout for your specific applications and scenarios.

## The quick lookup table

It's probably easier and more accurate to use types rather than the long "vector of plain arrays"
description. Thus here we use 2D RGB images as an example, for other cases you will need to adjust it a
bit.

The main tools at hand are:

* `colorview`/`channelview` to splat and combine image channels. Requires [`ImageCore`](https://github.com/JuliaImages/ImageCore.jl) (or [`Images`](https://github.com/JuliaImages/Images.jl)).
* `permutedims` to change orders. `PermutedDimsArray` is the lazy version of it.
* `Flux.stack`/`Flux.unstack` are used to combining mutiple small arrays into one big array or vice
versa. [`StackViews.StackView`](https://github.com/JuliaArrays/StackViews.jl) is lazy version of `Flux.stack`.

To reduce unnecessary memory allocations and get better transformation performance, we use lazy
versions in the table. Not every entry in this table is useful but I just leave them here as some
mental practice so that you can get better idea of how things work.

| From (order) | To (order) | Transformation
| ------------------------ | ----------------------- | -----------------------------------
| `Vector{Matrix{RGB{T}}}`(HW) | `Vector{Array{T,3}}`(CHW) | `channelview.(X)`
| `Vector{Matrix{RGB{T}}}`(HW) | `Vector{Array{T,3}}`(HWC) | `map(x->PermutedDimsArray(channelview(x), (2, 3, 1)), X)`
| `Vector{Matrix{RGB{T}}}`(HW) | `Array{RGB{T},3}`(WHN) | `StackView(X)`
| `Vector{Matrix{RGB{T}}}`(HW) | `Array{RGB{T},3}`(NWH) | `StackView(X, Val(1))`
| `Vector{Matrix{RGB{T}}}`(HW) | `Array{T,4}`(CHWN) | `channelview(StackView(X))`
| `Vector{Matrix{RGB{T}}}`(HW) | `Array{T,4}`(HWCN) | `PermutedDimsArray(channelview(StackView(X)), (2, 3, 1, 4))`
| `Vector{Array{T,3}}`(CHW) | `Vector{Matrix{RGB{T}}}`(HW) | `colorview.(RGB, X)`
| `Vector{Array{T,3}}`(CHW) | `Vector{Array{T,3}}`(CHW) | `PermutedDimsArray(X, (2, 3, 1))`
| `Vector{Array{T,3}}`(CHW) | `Array{RGB{T},3}`(HWN) | `StackView(colorview.(RGB, X))`
| `Vector{Array{T,3}}`(CHW) | `Array{RGB{T},3}`(NHW) | `StackView(colorview.(RGB, X), Val(1))`
| `Vector{Array{T,3}}`(CHW) | `Array{T,4}`(CHWN) | `StackView(X)`
| `Vector{Array{T,3}}`(CHW) | `Array{T,4}`(HWCN) | `PermutedDimsArray(StackView(X), (2, 3, 1, 4))`
| `Array{RGB{T},3}`(HWN) | `Vector{Matrix{RGB{T}}}`(HW) | `Flux.unstack(X)` (eager)
| `Array{RGB{T},3}`(HWN) | `Vector{Array{T,3}}`(CHW) | `Flux.unstack(channelview(X), 4)` (eager)
| `Array{RGB{T},3}`(HWN) | `Vector{Array{T,3}}`(HWC) | `Flux.unstack(PermutedDimsArray(channelview(X), (2, 3, 1, 4)), 4)` (eager)
| `Array{RGB{T},3}`(HWN) | `Array{T,4}`(CHWN) | `channelview(X)`
| `Array{RGB{T},3}`(HWN) | `Array{T,4}`(HWCN) | `PermutedDimsArray(channelview(X), (2, 3, 4, 1))`
| `Array{T,4}`(CHWN) | `Array{T,4}`(HWCN) | `PermutedDimsArray(X, (2, 3, 4, 1))`
| `Array{T,4}`(CHWN) | `Array{RGB{T},3}`(HWN) | `colorview(RGB, X)`
| `Array{T,4}`(CHWN) | `Vector{Array{T,3}}`(CHW) | `Flux.unstack(X, 4)` (eager)
| `Array{T,4}`(CHWN) | `Vector{Matrix{RGB{T}}}`(HW) | `Flux.unstack(colorview(RGB, X), 3)` (eager)
| `Array{T,4}`(HWCN) | `Array{T,4}`(CHWN) | `PermutedDimsArray(X, (3, 1, 2, 4))`
| `Array{T,4}`(HWCN) | `Array{RGB{T},3}`(HWN) | `colorview(RGB, PermutedDimsArray(X, (3, 1, 2, 4)))`
| `Array{T,4}`(HWCN) | `Vector{Array{T,3}}`(CHW) | `Flux.unstack(PermutedDimsArray(X, (3, 1, 2, 4)), 4)` (eager)
| `Array{T,4}`(HWCN) | `Vector{Matrix{RGB{T}}}`(HW) | `Flux.unstack(colorview(RGB, PermutedDimsArray(X, (3, 1, 2, 4))), 3)` (eager)

Addtional notes on this table:

* The listed transformations in the table are by no means the only possible version. The Julia
ecosystem has very generic array operations support, so you're invited to try other packages and
see if that fits your usage better.
* Height (H) and width (W) dimensions are often used interchangeably so we don't differentiate them
here. If that matters to you, then you should be able to add or modify the `permutedims`/`PermutedDimsArray`
arguments.
* If you're processing GPU arrays where `getindex` performance is extremely slow, you should
probably use the eager vector versions or insert some `collect`. Generally, a simple strategy is to
do all the permutations on CPU until you upload them to GPU.
* For performance consideration, the code here doesn't necessarily return the exact type declared in
the table. For instance, `channelview` outputs `ReinterpretArray` and not `Array`. In most cases
this doesn't matter at all, but if it matters, calling `collect` or similar functions could
generate a dense representation of it.
* Methods in this table might not be AD-ready.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would do a test of this and mention which ones are not AD-ready in the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this too and I've also thought about making the codes here tested against Documenter's doctest, but it seems that Flux documentation doesn't do this way.

[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
Functors = "d9f16b24-f501-4c13-a1f2-28368ffc5196"
NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"

Copy link
Member

@darsnack darsnack Sep 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking this would be a one-time manual step where one of us just runs through all the code in the table and makes note of the ones that error. Then we add a parenthetical for the ones that break AD.

As far as doctests for this page, there is no issue adding Images.jl to the docs Project.toml. None of the code here tests any of Flux's internals (any more than what's already tested), but it could be valuable to make sure we update this page in case anything goes out of date.


## Why there isn't one single function that just works for every different layout transformations?

It is theoretically possible but the entire ecosystem (Flux, JuliaImages, CUDA.jl and others) is not
darsnack marked this conversation as resolved.
Show resolved Hide resolved
yet ready. The main reason is that in Julia we want to write generic codes that works for all
different use cases. Hardcoding the meaning of plain numerical array `Array{<:Number, 4}` in any
format, say WHCN in Flux, would introduce unavoidable ambiguities. To make this actual work in practice
we need to give more type and dimension information to do dispatch (e.g., `RGB`) but this requires a
lot of fundamental supporting work from different Julia communities.

So why does Flux use WHCN format at present? This is mainly because Flux is backed by the C/C++
libraries (e.g., NVIDIA CuDNN) to do the computation for GPU arrays, which hardcode NCHW format (C/C++ is row-major order); we
will explain this further later. When Flux is smart enough to know how to unroll the data into raw numerical array and feed
to different backends, it is then possible that things just work for any generic array layout.

## What data layout should I use

Let's first start with a statement: there is no best data layout that suits every use case; whenever
you choose a data layout, you are making tradeoffs between usability and performance. For
better usability we write code at individual element-wise level, and for better performance we
emphasize the data locality for the specific computation pattern.

### Represent an image

The first layout choice that you might notice is Array of Struct (AOS) and Struct of Array (AOS).
We need to differentiate these because they are stored quite differently in memory: AOS in flat
memory representation is `RGBRGB...RGB` while SOA in flat memory representation is
`RR...RGG...GBB...B`.

When you store an image as `Array{<:RGB}`, you are storing it in AOS layout:

```julia
using ImageCore
using TestImages
using BenchmarkTools

rgb = [RGB(0.1, 0.2, 0.3), RGB(0.4, 0.5, 0.6), RGB(0.7, 0.8, 0.9)] # Vector{RGB{N0f8}}

# channelview opens the colorant struct without manipulating its memory
channelview(rgb)[:] == [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] # true
```

and when you're storing a RGB image as `Array{Float32, 3}`(HWC), you're indeed storing it in SOA layout:

```julia
rgb_soa = [0.1 0.2 0.3; 0.4 0.5 0.6; 0.7 0.8 0.9]
red_channel = rgb_soa[:, 1]
```

Due to the memory layout difference of AOS and SOA layout, the first and the most important
consequence is: AOS layout works best if the underlying function is defined per pixel because it
benifits from the data locality (e.g., CPU cache).

```julia
function colorjitter(img)
bias = 0.1rand(eltype(img))
return @. clamp01(img + bias)
end
function colorjitter_cv(img)
out = similar(img)
@inbounds for i in axes(img, 3)
bias = 0.1rand(eltype(img))
channel = @view img[:, :, i]
@. out[:, :, i] = clamp01(channel + bias)
end
return out
end

# array of struct layout
img = testimage("lighthouse"); # size (512, 768)
@btime colorjitter($img); # 1.035 ms (2 allocations: 9.00 MiB)

# struct of array layout
img = permutedims(channelview(testimage("lighthouse")), (2, 3, 1)) # size (512, 768, 3)
@btime colorjitter_cv($img); # 2.215 ms (2 allocations: 1.13 MiB)
```

Most kernel operations in JuliaImages are defined per-pixel, thus using AOS layout has better
performance. For a similar reason, SOA layout works best if the underlying function is defined per
channel. Because deep learning uses convolutional neural networks (CNNs) for image processing and
computer vision tasks, and because CNN is a set of filter operations defined per channel, CNN
backends (e.g., CuDNN) uses SOA layout to squeeze every last drop of performance out of the GPUs.

!!! note
Modern hardware and compiler tricks are more complicated and sophisticated such that this
SOA-AOS performance gap may not be as big as you can expect if enough engineer efforts were paid.

Except for the performance part, there are two other valid reasons on why JuliaImages uses the
array of struct layout:

* Julia's broadcasting is so powerful that you can only benefit from its simplicity if you commit to
the "every element of array is a computational scalar" assumption, and define colorant structs
such as `RGB` and store RGB image as `Array{RGB}` is a natural step towards this.
* JuliaImages is designed to process generic images (not just 2D RGB images), and using plain
`Array{Float32, 3}` to store data will cause ambiguities.

### Represent a batch

`Vector{Array{Float32, 3}}` (WHC) and `Array{Float32, 4}` (WHCN) are two typical ways to store RGB
images in SOA layout. What's the benefits of one over the other since they have the same memory
layout?

In Julia there isn't much difference because we have much better high-order function supports.
Assume that we have defined a _performant_ function `fn` to process one single image `Array{Float32, 3}`, it is
clear that storing data in `Vector{Array}` makes things easier to write without losing performance.

```julia
# If it is Vector{Array{Float32, 3}}
map(fn, batch) # much simpler!

# If it is Array{Float32, 4}
out = similar(batch)
for i in axes(batch, 4)
out[:, :, :, i] .= fn(@view(batch[:, :, :, i]))
end
```

But in Python, things will be quite different because both for-loop and `map` will introduce
significant overhead due to the dynamic nature of Python. Thus the solution is to vectorize[1] the
data by providing a bigger function `fn_vec` to handle a batch of images intead of one single image
so that `fn_vec(batch)` just works. Any network layer in Flux can be seen as `fn_vec`. The reason
`Array{Float32, 4}` is used by the C/C++ backends is because it has the simplest memory structure; a
big memory block with plain old data.

Then the conclusion: if we can build everything using pure Julia, then `Vector{Array{Float32, 3}}`
layout is the simplest solution. But the reality is that we don't have a Julia version of CuDNN that
has competitive performance. Thus for quite a long time in the future, we have to live with the
ambiguous `Array{Float32, 4}` layout.

!!! warning Disclaimer
There are many tricks in CUDA kernel programming to reduce the data transferring
waiting time and to increase the concurrency. These techniques can affect your data format
(e.g. whether to use `Vector{Array{Float32, 3}}` or `Array{Float32, 4}`).
This tutorial does not cover these tricks, so the recommendations above may change when
working with a GPU.

## References

* [1] For broadcasting and vectorization, please read [More Dots: Syntactic Loop Fusion in
Julia](https://julialang.org/blog/2017/01/moredots/)