-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add documentation for image layouts #1705
Open
johnnychen94
wants to merge
3
commits into
FluxML:master
Choose a base branch
from
johnnychen94:jc/image_layouts
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+237
−1
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,234 @@ | ||
# Handling different image layouts | ||
|
||
!!! tip | ||
The contents here are specific for images but similar reasoning can be applied to other data. | ||
|
||
When using Flux to do image processing and computer vision tasks, there are often a few different | ||
data layouts you'll encounter before passing your data into a network. For example, if you are trying to | ||
represent 64 2-dimensional RGB images of size 32x32, you could possibly store them using any of the | ||
following formats: | ||
|
||
* Vector of arrays, where each array is an array of struct: `Vector{Matrix{RGB{Float32}}}` | ||
* Vector of plain arrays: `Vector{Array{Float32, 3}}` | ||
* Array of struct: `Array{RGB{Float32}, 3}` | ||
* plain array: `Array{Float32, 4}` | ||
|
||
Even within the plain array case, there are various data memory orders depending on which dimensions we put the | ||
channel (C), height (H), width (H), and batch (N) information. Among all the possible combinations, | ||
two of them are used most commonly in deep learning frameworks: | ||
|
||
* WHCN: This is the Flux standard because Julia is column-major. | ||
* NCHW: Pytorch standard because it uses row-major by default. | ||
* CHWN: If we unroll the data of `Vector{Matrix{RGB{Float32}}}` without manipulate memory, we get | ||
this order. | ||
|
||
Depending on what ecosystem you are referring to to, all of these layouts and memory orders are | ||
useful. This page walks you through the transformaion between these data layouts. we will | ||
also explain some of the confusions you might meet, or in other words, explain why we need all these | ||
different layouts and why we can't just enforce one as a "standard." | ||
|
||
We will start with a quick lookup table style so you can easily find out what you need. Then we | ||
focus on the reasoning behind these layouts and their concrete usages in different ecosystems so you | ||
will know how to choose the most appropriate layout for your specific applications and scenarios. | ||
|
||
## The quick lookup table | ||
|
||
It's probably easier and more accurate to use types rather than the long "vector of plain arrays" | ||
description. Thus here we use 2D RGB images as an example, for other cases you will need to adjust it a | ||
bit. | ||
|
||
The main tools at hand are: | ||
|
||
* `colorview`/`channelview` to splat and combine image channels. Requires [`ImageCore`](https://github.com/JuliaImages/ImageCore.jl) (or [`Images`](https://github.com/JuliaImages/Images.jl)). | ||
* `permutedims` to change orders. `PermutedDimsArray` is the lazy version of it. | ||
* `Flux.stack`/`Flux.unstack` are used to combining mutiple small arrays into one big array or vice | ||
versa. [`StackViews.StackView`](https://github.com/JuliaArrays/StackViews.jl) is lazy version of `Flux.stack`. | ||
|
||
To reduce unnecessary memory allocations and get better transformation performance, we use lazy | ||
versions in the table. Not every entry in this table is useful but I just leave them here as some | ||
mental practice so that you can get better idea of how things work. | ||
|
||
| From (order) | To (order) | Transformation | ||
| ------------------------ | ----------------------- | ----------------------------------- | ||
| `Vector{Matrix{RGB{T}}}`(HW) | `Vector{Array{T,3}}`(CHW) | `channelview.(X)` | ||
| `Vector{Matrix{RGB{T}}}`(HW) | `Vector{Array{T,3}}`(HWC) | `map(x->PermutedDimsArray(channelview(x), (2, 3, 1)), X)` | ||
| `Vector{Matrix{RGB{T}}}`(HW) | `Array{RGB{T},3}`(WHN) | `StackView(X)` | ||
| `Vector{Matrix{RGB{T}}}`(HW) | `Array{RGB{T},3}`(NWH) | `StackView(X, Val(1))` | ||
| `Vector{Matrix{RGB{T}}}`(HW) | `Array{T,4}`(CHWN) | `channelview(StackView(X))` | ||
| `Vector{Matrix{RGB{T}}}`(HW) | `Array{T,4}`(HWCN) | `PermutedDimsArray(channelview(StackView(X)), (2, 3, 1, 4))` | ||
| `Vector{Array{T,3}}`(CHW) | `Vector{Matrix{RGB{T}}}`(HW) | `colorview.(RGB, X)` | ||
| `Vector{Array{T,3}}`(CHW) | `Vector{Array{T,3}}`(CHW) | `PermutedDimsArray(X, (2, 3, 1))` | ||
| `Vector{Array{T,3}}`(CHW) | `Array{RGB{T},3}`(HWN) | `StackView(colorview.(RGB, X))` | ||
| `Vector{Array{T,3}}`(CHW) | `Array{RGB{T},3}`(NHW) | `StackView(colorview.(RGB, X), Val(1))` | ||
| `Vector{Array{T,3}}`(CHW) | `Array{T,4}`(CHWN) | `StackView(X)` | ||
| `Vector{Array{T,3}}`(CHW) | `Array{T,4}`(HWCN) | `PermutedDimsArray(StackView(X), (2, 3, 1, 4))` | ||
| `Array{RGB{T},3}`(HWN) | `Vector{Matrix{RGB{T}}}`(HW) | `Flux.unstack(X)` (eager) | ||
| `Array{RGB{T},3}`(HWN) | `Vector{Array{T,3}}`(CHW) | `Flux.unstack(channelview(X), 4)` (eager) | ||
| `Array{RGB{T},3}`(HWN) | `Vector{Array{T,3}}`(HWC) | `Flux.unstack(PermutedDimsArray(channelview(X), (2, 3, 1, 4)), 4)` (eager) | ||
| `Array{RGB{T},3}`(HWN) | `Array{T,4}`(CHWN) | `channelview(X)` | ||
| `Array{RGB{T},3}`(HWN) | `Array{T,4}`(HWCN) | `PermutedDimsArray(channelview(X), (2, 3, 4, 1))` | ||
| `Array{T,4}`(CHWN) | `Array{T,4}`(HWCN) | `PermutedDimsArray(X, (2, 3, 4, 1))` | ||
| `Array{T,4}`(CHWN) | `Array{RGB{T},3}`(HWN) | `colorview(RGB, X)` | ||
| `Array{T,4}`(CHWN) | `Vector{Array{T,3}}`(CHW) | `Flux.unstack(X, 4)` (eager) | ||
| `Array{T,4}`(CHWN) | `Vector{Matrix{RGB{T}}}`(HW) | `Flux.unstack(colorview(RGB, X), 3)` (eager) | ||
| `Array{T,4}`(HWCN) | `Array{T,4}`(CHWN) | `PermutedDimsArray(X, (3, 1, 2, 4))` | ||
| `Array{T,4}`(HWCN) | `Array{RGB{T},3}`(HWN) | `colorview(RGB, PermutedDimsArray(X, (3, 1, 2, 4)))` | ||
| `Array{T,4}`(HWCN) | `Vector{Array{T,3}}`(CHW) | `Flux.unstack(PermutedDimsArray(X, (3, 1, 2, 4)), 4)` (eager) | ||
| `Array{T,4}`(HWCN) | `Vector{Matrix{RGB{T}}}`(HW) | `Flux.unstack(colorview(RGB, PermutedDimsArray(X, (3, 1, 2, 4))), 3)` (eager) | ||
|
||
Addtional notes on this table: | ||
|
||
* The listed transformations in the table are by no means the only possible version. The Julia | ||
ecosystem has very generic array operations support, so you're invited to try other packages and | ||
see if that fits your usage better. | ||
* Height (H) and width (W) dimensions are often used interchangeably so we don't differentiate them | ||
here. If that matters to you, then you should be able to add or modify the `permutedims`/`PermutedDimsArray` | ||
arguments. | ||
* If you're processing GPU arrays where `getindex` performance is extremely slow, you should | ||
probably use the eager vector versions or insert some `collect`. Generally, a simple strategy is to | ||
do all the permutations on CPU until you upload them to GPU. | ||
* For performance consideration, the code here doesn't necessarily return the exact type declared in | ||
the table. For instance, `channelview` outputs `ReinterpretArray` and not `Array`. In most cases | ||
this doesn't matter at all, but if it matters, calling `collect` or similar functions could | ||
generate a dense representation of it. | ||
* Methods in this table might not be AD-ready. | ||
|
||
## Why there isn't one single function that just works for every different layout transformations? | ||
|
||
It is theoretically possible but the entire ecosystem (Flux, JuliaImages, CUDA.jl and others) is not | ||
darsnack marked this conversation as resolved.
Show resolved
Hide resolved
|
||
yet ready. The main reason is that in Julia we want to write generic codes that works for all | ||
different use cases. Hardcoding the meaning of plain numerical array `Array{<:Number, 4}` in any | ||
format, say WHCN in Flux, would introduce unavoidable ambiguities. To make this actual work in practice | ||
we need to give more type and dimension information to do dispatch (e.g., `RGB`) but this requires a | ||
lot of fundamental supporting work from different Julia communities. | ||
|
||
So why does Flux use WHCN format at present? This is mainly because Flux is backed by the C/C++ | ||
libraries (e.g., NVIDIA CuDNN) to do the computation for GPU arrays, which hardcode NCHW format (C/C++ is row-major order); we | ||
will explain this further later. When Flux is smart enough to know how to unroll the data into raw numerical array and feed | ||
to different backends, it is then possible that things just work for any generic array layout. | ||
|
||
## What data layout should I use | ||
|
||
Let's first start with a statement: there is no best data layout that suits every use case; whenever | ||
you choose a data layout, you are making tradeoffs between usability and performance. For | ||
better usability we write code at individual element-wise level, and for better performance we | ||
emphasize the data locality for the specific computation pattern. | ||
|
||
### Represent an image | ||
|
||
The first layout choice that you might notice is Array of Struct (AOS) and Struct of Array (AOS). | ||
We need to differentiate these because they are stored quite differently in memory: AOS in flat | ||
memory representation is `RGBRGB...RGB` while SOA in flat memory representation is | ||
`RR...RGG...GBB...B`. | ||
|
||
When you store an image as `Array{<:RGB}`, you are storing it in AOS layout: | ||
|
||
```julia | ||
using ImageCore | ||
using TestImages | ||
using BenchmarkTools | ||
|
||
rgb = [RGB(0.1, 0.2, 0.3), RGB(0.4, 0.5, 0.6), RGB(0.7, 0.8, 0.9)] # Vector{RGB{N0f8}} | ||
|
||
# channelview opens the colorant struct without manipulating its memory | ||
channelview(rgb)[:] == [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] # true | ||
``` | ||
|
||
and when you're storing a RGB image as `Array{Float32, 3}`(HWC), you're indeed storing it in SOA layout: | ||
|
||
```julia | ||
rgb_soa = [0.1 0.2 0.3; 0.4 0.5 0.6; 0.7 0.8 0.9] | ||
red_channel = rgb_soa[:, 1] | ||
``` | ||
|
||
Due to the memory layout difference of AOS and SOA layout, the first and the most important | ||
consequence is: AOS layout works best if the underlying function is defined per pixel because it | ||
benifits from the data locality (e.g., CPU cache). | ||
|
||
```julia | ||
function colorjitter(img) | ||
bias = 0.1rand(eltype(img)) | ||
return @. clamp01(img + bias) | ||
end | ||
function colorjitter_cv(img) | ||
out = similar(img) | ||
@inbounds for i in axes(img, 3) | ||
bias = 0.1rand(eltype(img)) | ||
channel = @view img[:, :, i] | ||
@. out[:, :, i] = clamp01(channel + bias) | ||
end | ||
return out | ||
end | ||
|
||
# array of struct layout | ||
img = testimage("lighthouse"); # size (512, 768) | ||
@btime colorjitter($img); # 1.035 ms (2 allocations: 9.00 MiB) | ||
|
||
# struct of array layout | ||
img = permutedims(channelview(testimage("lighthouse")), (2, 3, 1)) # size (512, 768, 3) | ||
@btime colorjitter_cv($img); # 2.215 ms (2 allocations: 1.13 MiB) | ||
``` | ||
|
||
Most kernel operations in JuliaImages are defined per-pixel, thus using AOS layout has better | ||
performance. For a similar reason, SOA layout works best if the underlying function is defined per | ||
channel. Because deep learning uses convolutional neural networks (CNNs) for image processing and | ||
computer vision tasks, and because CNN is a set of filter operations defined per channel, CNN | ||
backends (e.g., CuDNN) uses SOA layout to squeeze every last drop of performance out of the GPUs. | ||
|
||
!!! note | ||
Modern hardware and compiler tricks are more complicated and sophisticated such that this | ||
SOA-AOS performance gap may not be as big as you can expect if enough engineer efforts were paid. | ||
|
||
Except for the performance part, there are two other valid reasons on why JuliaImages uses the | ||
array of struct layout: | ||
|
||
* Julia's broadcasting is so powerful that you can only benefit from its simplicity if you commit to | ||
the "every element of array is a computational scalar" assumption, and define colorant structs | ||
such as `RGB` and store RGB image as `Array{RGB}` is a natural step towards this. | ||
* JuliaImages is designed to process generic images (not just 2D RGB images), and using plain | ||
`Array{Float32, 3}` to store data will cause ambiguities. | ||
|
||
### Represent a batch | ||
|
||
`Vector{Array{Float32, 3}}` (WHC) and `Array{Float32, 4}` (WHCN) are two typical ways to store RGB | ||
images in SOA layout. What's the benefits of one over the other since they have the same memory | ||
layout? | ||
|
||
In Julia there isn't much difference because we have much better high-order function supports. | ||
Assume that we have defined a _performant_ function `fn` to process one single image `Array{Float32, 3}`, it is | ||
clear that storing data in `Vector{Array}` makes things easier to write without losing performance. | ||
|
||
```julia | ||
# If it is Vector{Array{Float32, 3}} | ||
map(fn, batch) # much simpler! | ||
|
||
# If it is Array{Float32, 4} | ||
out = similar(batch) | ||
for i in axes(batch, 4) | ||
out[:, :, :, i] .= fn(@view(batch[:, :, :, i])) | ||
end | ||
``` | ||
|
||
But in Python, things will be quite different because both for-loop and `map` will introduce | ||
significant overhead due to the dynamic nature of Python. Thus the solution is to vectorize[1] the | ||
data by providing a bigger function `fn_vec` to handle a batch of images intead of one single image | ||
so that `fn_vec(batch)` just works. Any network layer in Flux can be seen as `fn_vec`. The reason | ||
`Array{Float32, 4}` is used by the C/C++ backends is because it has the simplest memory structure; a | ||
big memory block with plain old data. | ||
|
||
Then the conclusion: if we can build everything using pure Julia, then `Vector{Array{Float32, 3}}` | ||
layout is the simplest solution. But the reality is that we don't have a Julia version of CuDNN that | ||
has competitive performance. Thus for quite a long time in the future, we have to live with the | ||
ambiguous `Array{Float32, 4}` layout. | ||
|
||
!!! warning Disclaimer | ||
There are many tricks in CUDA kernel programming to reduce the data transferring | ||
waiting time and to increase the concurrency. These techniques can affect your data format | ||
(e.g. whether to use `Vector{Array{Float32, 3}}` or `Array{Float32, 4}`). | ||
This tutorial does not cover these tricks, so the recommendations above may change when | ||
working with a GPU. | ||
|
||
## References | ||
|
||
* [1] For broadcasting and vectorization, please read [More Dots: Syntactic Loop Fusion in | ||
Julia](https://julialang.org/blog/2017/01/moredots/) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would do a test of this and mention which ones are not AD-ready in the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about this too and I've also thought about making the codes here tested against Documenter's
doctest
, but it seems that Flux documentation doesn't do this way.Flux.jl/docs/Project.toml
Lines 1 to 4 in 7ada01c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking this would be a one-time manual step where one of us just runs through all the code in the table and makes note of the ones that error. Then we add a parenthetical for the ones that break AD.
As far as doctests for this page, there is no issue adding Images.jl to the docs Project.toml. None of the code here tests any of Flux's internals (any more than what's already tested), but it could be valuable to make sure we update this page in case anything goes out of date.