Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several typos #482

Merged
merged 1 commit into from
Apr 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/src/devdocs/loopset_structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ References to arrays are represented with an `ArrayReferenceMeta` data structure
julia> LoopVectorization.operations(lsAmulB)[3].ref
LoopVectorization.ArrayReferenceMeta(LoopVectorization.ArrayReference(:A, [:m, :k], Int8[0, 0]), Bool[1, 1], Symbol("##vptr##_A"))
```
It contains the name of the parent array (`:A`), the indicies `[:m,:k]`, and a boolean vector (`Bool[1, 1]`) indicating whether these indices are loop iterables. Note that the optimizer assumes arrays are column-major, and thus that it is efficient to read contiguous elements from the first index. In lower level terms, it means that [high-throughput vmov](https://www.felixcloutier.com/x86/movupd) instructions can be used rather than [low-throughput](https://www.felixcloutier.com/x86/vgatherdpd:vgatherqpd) [gathers](https://www.felixcloutier.com/x86/vgatherqps:vgatherqpd). Similar story for storing elements.
It contains the name of the parent array (`:A`), the indices `[:m,:k]`, and a boolean vector (`Bool[1, 1]`) indicating whether these indices are loop iterables. Note that the optimizer assumes arrays are column-major, and thus that it is efficient to read contiguous elements from the first index. In lower level terms, it means that [high-throughput vmov](https://www.felixcloutier.com/x86/movupd) instructions can be used rather than [low-throughput](https://www.felixcloutier.com/x86/vgatherdpd:vgatherqpd) [gathers](https://www.felixcloutier.com/x86/vgatherqps:vgatherqpd). Similar story for storing elements.
When no axis has unit stride, the first given index will be the dummy `Symbol("##DISCONTIGUOUSSUBARRAY##")`.

!!! warning
Expand Down
2 changes: 1 addition & 1 deletion docs/src/examples/array_interface.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ LoopVectorization uses [ArrayInterface.jl](https://github.com/SciML/ArrayInterfa
that wasn't optimized by `LoopVectorization`, but instead simply had `@inbounds @fastmath` applied to the loop. This can often still yield reasonable to good performance, saving you from having to write more than one version of the loop
to get good performance and correct behavior just because the array types happen to be different.

By supporting the interface, using `LoopVectorization` can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of [a few hundred lines of code](https://github.com/JuliaArrays/StaticArrays.jl/blob/0e431022954f0207eeb2c4f661b9f76936105c8a/src/matrix_multiply.jl#L4) to define matix multiplication in `StaticArrays`, one could simply write:
By supporting the interface, using `LoopVectorization` can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of [a few hundred lines of code](https://github.com/JuliaArrays/StaticArrays.jl/blob/0e431022954f0207eeb2c4f661b9f76936105c8a/src/matrix_multiply.jl#L4) to define matrix multiplication in `StaticArrays`, one could simply write:
```julia
using StaticArrays, LoopVectorization

Expand Down
4 changes: 2 additions & 2 deletions docs/src/examples/dot_product.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Thus, in 4 clock cycles, we can do up to 8 loads. But each `fma` requires 2 load

Double precision benchmarks pitting Julia's builtin dot product, and code compiled with a variety of compilers:
![dot](https://raw.githubusercontent.com/JuliaSIMD/LoopVectorization.jl/docsassets/docs/src/assets/bench_dot_v2.svg)
What we just described is the core of the approach used by all these compilers. The variation in results is explained mostly by how they handle vectors with lengths that are not an integer multiple of `W`. I ran these on a computer with AVX512 so that `W = 8`. LLVM, the backend compiler of both Julia and Clang, shows rapid performance degredation as `N % 4W` increases, where `N` is the length of the vectors.
What we just described is the core of the approach used by all these compilers. The variation in results is explained mostly by how they handle vectors with lengths that are not an integer multiple of `W`. I ran these on a computer with AVX512 so that `W = 8`. LLVM, the backend compiler of both Julia and Clang, shows rapid performance degradation as `N % 4W` increases, where `N` is the length of the vectors.
This is because, to handle the remainder, it uses a scalar loop that runs as written: multiply and add single elements, one after the other.

Initially, GCC (gfortran) stumbled in throughput, because it does not use separate accumulation vectors by default except on Power, even with `-funroll-loops`.
Expand All @@ -36,7 +36,7 @@ LoopVectorization uses `if/ifelse` checks to determine how many extra vectors ar

Neither GCC nor LLVM use masks (without LoopVectorization's assitance).

I am not certain, but I believe Intel and GCC check for the vector's alignment, and align them if neccessary. Julia guarantees that the start of arrays beyond a certain size are aligned, so this is not an optimization I have implemented. But it may be worthwhile for handling large matrices with a number of rows that isn't an integer multiple of `W`. For such matrices, the first column may be aligned, but the next will not be.
I am not certain, but I believe Intel and GCC check for the vector's alignment, and align them if necessary. Julia guarantees that the start of arrays beyond a certain size are aligned, so this is not an optimization I have implemented. But it may be worthwhile for handling large matrices with a number of rows that isn't an integer multiple of `W`. For such matrices, the first column may be aligned, but the next will not be.

## Dot-Self

Expand Down
2 changes: 1 addition & 1 deletion docs/src/examples/matrix_multiplication.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Letting all three matrices be square and `Size` x `Size`, we attain the followin
This is classic GEMM, `𝐂 = 𝐀 * 𝐁`. GFortran's intrinsic `matmul` function does fairly well. But all the compilers are well behind LoopVectorization here, which falls behind MKL's `gemm` beyond 70x70 or so. The problem imposed by alignment is also striking: performance is much higher when the sizes are integer multiplies of 8. Padding arrays so that each column is aligned regardless of the number of rows can thus be very profitable. [PaddedMatrices.jl](https://github.com/JuliaSIMD/PaddedMatrices.jl) offers just such arrays in Julia. I believe that is also what the [-pad](https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-pad-qpad) compiler flag does when using Intel's compilers.

![AmulBt](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/bench_AmulBt_v2.svg)
The optimal pattern for `𝐂 = 𝐀 * 𝐁ᵀ` is almost identical to that for `𝐂 = 𝐀 * 𝐁`. Yet, gfortran's `matmul` instrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for `𝐁ᵀ` and creating the ecplicit copy.
The optimal pattern for `𝐂 = 𝐀 * 𝐁ᵀ` is almost identical to that for `𝐂 = 𝐀 * 𝐁`. Yet, gfortran's `matmul` intrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for `𝐁ᵀ` and creating the explicit copy.

ifort did equally well whethor or not `𝐁` was transposed, while LoopVectorization's performance degraded slightly faster as a function of size in the transposed case, because strides between memory accesses are larger when `𝐁` is transposed. But it still performed best of all the compiled loops over this size range, losing out to MKL and eventually OpenBLAS.
icc interestingly does better when it is transposed.
Expand Down
4 changes: 2 additions & 2 deletions docs/src/examples/multithreading.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ end
```
![complexdot3](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/threadedcomplexdot3product.svg)

When testing on my laptop, the `C` implentation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading,
When testing on my laptop, the `C` implementation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading,
or if it's because LoopVectorization's memory access patterns are less friendly.
I plan to work on cache-level blocking to increase memory friendliness eventually, and will likely also allow it to take advantage of hyperthreading/simultaneous multithreading, although I'd prefer a few motivating test problems to look at first. Note that a single core of this CPU is capable of exceeding 100 GFLOPS of double precision compute. The execution units are spending most of their time idle. So the question of whether hypthreading helps may be one of whether or not we are memory-limited.

Expand Down Expand Up @@ -218,7 +218,7 @@ julia> doubles_per_l2 = (2 ^ 20) ÷ 8
julia> total_doubles_in_l2 = doubles_per_l2 * (Sys.CPU_THREADS ÷ 2) # doubles_per_l2 * 18
2359296

julia> doubles_per_mat = total_doubles_in_l2 ÷ 3 # divide up amoung 3 matrices
julia> doubles_per_mat = total_doubles_in_l2 ÷ 3 # divide up among 3 matrices
786432

julia> sqrt(ans)
Expand Down
4 changes: 2 additions & 2 deletions docs/src/examples/special_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ end
While Intel's proprietary compilers do the best, LoopVectorization performs very well among open source alternatives. A complicating
factor to the above benchmark is that in accessing the diagonals, we are not accessing contiguous elements. A benchmark
simply exponentiating a vector shows that `gcc` also has efficient special function vectorization, but that the autovectorizer
disagrees with the discontiguous memory acesses:
disagrees with the discontiguous memory accesses:

![selfdot](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/bench_exp_v2.svg)

The similar performance between `gfortran` and `LoopVectorization` at multiples of 8 is no fluke: on Linux systems with a recent GLIBC, SLEEFPirates.jl --
which LoopVectorization depends on to vectorize these special functions -- looks for the GNU vector library and uses these functions
if available. Otherwise, it will use native Julia implementations that tend to be slower. As the modulus of vector length and vector width (8, on the
host system thanks to AVX512) increases, `gfortran` shows the performance degredation pattern typical of LLVM-vectorized code.
host system thanks to AVX512) increases, `gfortran` shows the performance degradation pattern typical of LLVM-vectorized code.

2 changes: 1 addition & 1 deletion src/codegen/lower_compute.jl
Original file line number Diff line number Diff line change
Expand Up @@ -529,7 +529,7 @@ function lower_compute!(
parents_op = parents(op)
nparents = length(parents_op)
# __u₂max = ls.unrollspecification.u₂
# TODO: perhaps allos for swithcing unrolled axis again
# TODO: perhaps allow for switching unrolled axis again
mvar, u₁unrolledsym, u₂unrolledsym =
variable_name_and_unrolled(op, u₁loopsym, u₂loopsym, vloopsym, suffix, ls)
opunrolled = u₁unrolledsym || isu₁unrolled(op)
Expand Down
6 changes: 3 additions & 3 deletions src/codegen/lower_load.jl
Original file line number Diff line number Diff line change
Expand Up @@ -157,15 +157,15 @@ function pushbroadcast!(q::Expr, mvar::Symbol)
)
end

function child_cost_untill_vectorized(op::Operation)
function child_cost_until_vectorized(op::Operation)
isvectorized(op) && return 0.0
c = 0.0
for child ∈ children(op)
if (!isvectorized(child) & iscompute(child))
# FIXME: can double count
c +=
instruction_cost(instruction(child)).scalar_reciprocal_throughput +
child_cost_untill_vectorized(child)
child_cost_until_vectorized(child)
end
end
c
Expand All @@ -174,7 +174,7 @@ function vectorization_profitable(op::Operation)
# if op is vectorized itself, return true
isvectorized(op) && return true
# otherwise, check if descendents until hitting a vectorized portion are expensive enough
child_cost_untill_vectorized(op) ≥ 5
child_cost_until_vectorized(op) ≥ 5
end

function lower_load_no_optranslation!(
Expand Down
4 changes: 2 additions & 2 deletions src/codegen/lower_store.jl
Original file line number Diff line number Diff line change
Expand Up @@ -377,7 +377,7 @@ function lower_tiled_store!(
inds_calc_by_ptr_offset = indices_calculated_by_pointer_offsets(ls, op.ref)

if donot_tile_store(ls, op, reductfunc, u₂)
# If we have a reductfunc, we're using a reducing store instead of a contiuguous or shuffle store anyway
# If we have a reductfunc, we're using a reducing store instead of a contiguous or shuffle store anyway
# so no benefit to being able to handle that case here, vs just calling the default `lower_store!` method
@unpack u₁, u₂max = ua
for t ∈ 0:u₂-1
Expand Down Expand Up @@ -408,7 +408,7 @@ function lower_tiled_store!(
u = Core.ifelse(isu₁, u₁, 1)
tup = Expr(:tuple)
for t ∈ 0:u₂-1
# tiled stores cannot be loop values, as they're necessarilly
# tiled stores cannot be loop values, as they're necessarily
# functions of at least two loops, meaning we do not need to handle them here.
push!(tup.args, Symbol(variable_name(opp, ifelse(isu₂, t, -1)), '_', u))
end
Expand Down
6 changes: 3 additions & 3 deletions src/codegen/lower_threads.jl
Original file line number Diff line number Diff line change
Expand Up @@ -977,10 +977,10 @@ function avx_threads_expr(
LPSYM::Expr
)
valid_thread_loop, ua, c = valid_thread_loops(ls)
num_candiates = sum(valid_thread_loop)
if (num_candiates == 0) || (nt ≤ 1) # it was called from `avx_body` but now `nt` was set to `1`
num_candidates = sum(valid_thread_loop)
if (num_candidates == 0) || (nt ≤ 1) # it was called from `avx_body` but now `nt` was set to `1`
avx_body(ls, UNROLL)
elseif (num_candiates == 1) || (nt ≤ 3)
elseif (num_candidates == 1) || (nt ≤ 3)
thread_one_loops_expr(
ls,
ua,
Expand Down
2 changes: 1 addition & 1 deletion src/codegen/lowering.jl
Original file line number Diff line number Diff line change
Expand Up @@ -1234,7 +1234,7 @@ function calc_Ureduct!(ls::LoopSet, us::UnrollSpecification)
elseif !((u₁ui == Int(u₁u)) & (u₂ui == Int(u₁u)))
throw(
ArgumentError(
"Doesn't currenly handle differently unrolled reductions yet, please file an issue with an example."
"Doesn't currently handle differently unrolled reductions yet, please file an issue with an example."
)
)
end
Expand Down
2 changes: 1 addition & 1 deletion src/condense_loopset.jl
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,7 @@ function should_zerorangestart(
allsame = true
# The idea here is that if any ref to the same array doesn't have `ind`,
# we can't offset that dimension because different inds will clash.
# Because offseting the array means counter-offseting the range, we need
# Because offsetting the array means counter-offsetting the range, we need
# to be consistent, and check that all arrays are valid first.
for j ∈ @view(namev[2:end])
ref = allarrayrefs[j]
Expand Down
2 changes: 1 addition & 1 deletion src/constructors.jl
Original file line number Diff line number Diff line change
Expand Up @@ -407,7 +407,7 @@ end
@tturbo

Equivalent to `@turbo`, except it adds `thread=true` as the first keyword argument.
Note that later arguments take precendence.
Note that later arguments take precedence.

Meant for convenience, as `@tturbo` is shorter than `@turbo thread=true`.
"""
Expand Down
4 changes: 2 additions & 2 deletions src/modeling/determinestrategy.jl
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ function evaluate_cost_unroll(
# Need to check if fusion is possible
for itersym ∈ order
cacheunrolled!(ls, itersym, Symbol(""), vloopsym)
# Add to set of defined symbles
# Add to set of defined symbols
push!(nested_loop_syms, itersym)
looplength = length(ls, itersym)
liter = itersym === vloopsym ? num_iterations(looplength, W) : looplength
Expand Down Expand Up @@ -1257,7 +1257,7 @@ function evaluate_cost_tile!(
elseif itersym == u₂loopsym
u₂reached = true
end
# Add to set of defined symbles
# Add to set of defined symbols
push!(nested_loop_syms, itersym)
looplength = length(ls, itersym)
iter *= itersym === vloopsym ? num_iterations(looplength, W) : looplength
Expand Down
4 changes: 2 additions & 2 deletions src/modeling/graphs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -1805,7 +1805,7 @@ function push_op!(
add_andblock!(ls, ex, elementbytes, position)
elseif ex.head === :||
add_orblock!(ls, ex, elementbytes, position)
elseif ex.head === :local # Handle locals introduced by `@inbounds`; using `local` with `@turbo` is not recomended (nor is `@inbounds`; which applies automatically regardless)
elseif ex.head === :local # Handle locals introduced by `@inbounds`; using `local` with `@turbo` is not recommended (nor is `@inbounds`; which applies automatically regardless)
@assert length(ex.args) == 1 # TODO replace assert + first with "only" once support for Julia < 1.4 is dropped
localbody = first(ex.args)
@assert localbody.head === :(=)
Expand Down Expand Up @@ -2117,7 +2117,7 @@ end

"""
Returns `0` if the op is the declaration of the constant outerreduction variable.
Returns `n`, where `n` is the constant declarations's index among parents(op), if op is an outter reduction.
Returns `n`, where `n` is the constant declarations's index among parents(op), if op is an outer reduction.
Returns `-1` if not an outerreduction.
"""
function isouterreduction(ls::LoopSet, op::Operation)
Expand Down
4 changes: 2 additions & 2 deletions src/modeling/operations.jl
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ loopvalue
"""
Operation

A structure to encode a particular action occuring inside an `@turbo` block.
A structure to encode a particular action occurring inside an `@turbo` block.

# Fields

Expand Down Expand Up @@ -196,7 +196,7 @@ Each one of these lines is a pretty-printed `Operation`.
"""
mutable struct Operation <: AbstractLoopOperation
"""A unique identifier for this operation.
`identifer(op::Operation)` returns the index of this operation within `operations(ls::LoopSet)`."""
`identifier(op::Operation)` returns the index of this operation within `operations(ls::LoopSet)`."""
identifier::Int
"""The name of the variable storing the result of this operation.
For `a = val` this would be `:a`. For array assignments `A[i,j] = val` this would be `:A`."""
Expand Down
2 changes: 1 addition & 1 deletion src/parse/memory_ops_common.jl
Original file line number Diff line number Diff line change
Expand Up @@ -585,7 +585,7 @@ function checkforoffset!(
if length(mult_syms) == 1
mlt, sym = only(mult_syms)
if !byterepresentable(mlt)
# this is so we don't unnecessarilly add a separate offset
# this is so we don't unnecessarily add a separate offset
muladd_index!(
ls,
opparents,
Expand Down
2 changes: 1 addition & 1 deletion test/gemm.jl
Original file line number Diff line number Diff line change
Expand Up @@ -804,7 +804,7 @@
dense!(LoopVectorization.relu, C, A2, B)
@test C ≈
LoopVectorization.relu.(@view(A2[:, begin:end-1]) * B .+ @view(A2[:, end]))
@testset "avx $T dynamc gemm" begin
@testset "avx $T dynamic gemm" begin
AmulB!(C2, A, B)
AmulBavx1!(C, A, B)
@test C ≈ C2
Expand Down
2 changes: 1 addition & 1 deletion test/steprange.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@



# Auxillary functions
# Auxiliary functions
const _uint_bit_length = sizeof(UInt) * 8
const _div_uint_size_shift = Int(log2(_uint_bit_length))
@inline _mul2(i::Integer) = i << 1
Expand Down