From 1421459dd4b768453ccb49e9df6538bc9e2bd0ef Mon Sep 17 00:00:00 2001 From: Alexander Seiler Date: Sun, 2 Apr 2023 05:19:58 +0200 Subject: [PATCH] Fix several typos (#482) Signed-off-by: Alexander Seiler --- docs/src/devdocs/loopset_structure.md | 2 +- docs/src/examples/array_interface.md | 2 +- docs/src/examples/dot_product.md | 4 ++-- docs/src/examples/matrix_multiplication.md | 2 +- docs/src/examples/multithreading.md | 4 ++-- docs/src/examples/special_functions.md | 4 ++-- src/codegen/lower_compute.jl | 2 +- src/codegen/lower_load.jl | 6 +++--- src/codegen/lower_store.jl | 4 ++-- src/codegen/lower_threads.jl | 6 +++--- src/codegen/lowering.jl | 2 +- src/condense_loopset.jl | 2 +- src/constructors.jl | 2 +- src/modeling/determinestrategy.jl | 4 ++-- src/modeling/graphs.jl | 4 ++-- src/modeling/operations.jl | 4 ++-- src/parse/memory_ops_common.jl | 2 +- test/gemm.jl | 2 +- test/steprange.jl | 2 +- 19 files changed, 30 insertions(+), 30 deletions(-) diff --git a/docs/src/devdocs/loopset_structure.md b/docs/src/devdocs/loopset_structure.md index 923dc7aeb..c51d0d512 100644 --- a/docs/src/devdocs/loopset_structure.md +++ b/docs/src/devdocs/loopset_structure.md @@ -56,7 +56,7 @@ References to arrays are represented with an `ArrayReferenceMeta` data structure julia> LoopVectorization.operations(lsAmulB)[3].ref LoopVectorization.ArrayReferenceMeta(LoopVectorization.ArrayReference(:A, [:m, :k], Int8[0, 0]), Bool[1, 1], Symbol("##vptr##_A")) ``` -It contains the name of the parent array (`:A`), the indicies `[:m,:k]`, and a boolean vector (`Bool[1, 1]`) indicating whether these indices are loop iterables. Note that the optimizer assumes arrays are column-major, and thus that it is efficient to read contiguous elements from the first index. In lower level terms, it means that [high-throughput vmov](https://www.felixcloutier.com/x86/movupd) instructions can be used rather than [low-throughput](https://www.felixcloutier.com/x86/vgatherdpd:vgatherqpd) [gathers](https://www.felixcloutier.com/x86/vgatherqps:vgatherqpd). Similar story for storing elements. +It contains the name of the parent array (`:A`), the indices `[:m,:k]`, and a boolean vector (`Bool[1, 1]`) indicating whether these indices are loop iterables. Note that the optimizer assumes arrays are column-major, and thus that it is efficient to read contiguous elements from the first index. In lower level terms, it means that [high-throughput vmov](https://www.felixcloutier.com/x86/movupd) instructions can be used rather than [low-throughput](https://www.felixcloutier.com/x86/vgatherdpd:vgatherqpd) [gathers](https://www.felixcloutier.com/x86/vgatherqps:vgatherqpd). Similar story for storing elements. When no axis has unit stride, the first given index will be the dummy `Symbol("##DISCONTIGUOUSSUBARRAY##")`. !!! warning diff --git a/docs/src/examples/array_interface.md b/docs/src/examples/array_interface.md index 4c765c85d..603291280 100644 --- a/docs/src/examples/array_interface.md +++ b/docs/src/examples/array_interface.md @@ -9,7 +9,7 @@ LoopVectorization uses [ArrayInterface.jl](https://github.com/SciML/ArrayInterfa that wasn't optimized by `LoopVectorization`, but instead simply had `@inbounds @fastmath` applied to the loop. This can often still yield reasonable to good performance, saving you from having to write more than one version of the loop to get good performance and correct behavior just because the array types happen to be different. -By supporting the interface, using `LoopVectorization` can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of [a few hundred lines of code](https://github.com/JuliaArrays/StaticArrays.jl/blob/0e431022954f0207eeb2c4f661b9f76936105c8a/src/matrix_multiply.jl#L4) to define matix multiplication in `StaticArrays`, one could simply write: +By supporting the interface, using `LoopVectorization` can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of [a few hundred lines of code](https://github.com/JuliaArrays/StaticArrays.jl/blob/0e431022954f0207eeb2c4f661b9f76936105c8a/src/matrix_multiply.jl#L4) to define matrix multiplication in `StaticArrays`, one could simply write: ```julia using StaticArrays, LoopVectorization diff --git a/docs/src/examples/dot_product.md b/docs/src/examples/dot_product.md index 086543a73..7fc9d4b78 100644 --- a/docs/src/examples/dot_product.md +++ b/docs/src/examples/dot_product.md @@ -25,7 +25,7 @@ Thus, in 4 clock cycles, we can do up to 8 loads. But each `fma` requires 2 load Double precision benchmarks pitting Julia's builtin dot product, and code compiled with a variety of compilers: ![dot](https://raw.githubusercontent.com/JuliaSIMD/LoopVectorization.jl/docsassets/docs/src/assets/bench_dot_v2.svg) -What we just described is the core of the approach used by all these compilers. The variation in results is explained mostly by how they handle vectors with lengths that are not an integer multiple of `W`. I ran these on a computer with AVX512 so that `W = 8`. LLVM, the backend compiler of both Julia and Clang, shows rapid performance degredation as `N % 4W` increases, where `N` is the length of the vectors. +What we just described is the core of the approach used by all these compilers. The variation in results is explained mostly by how they handle vectors with lengths that are not an integer multiple of `W`. I ran these on a computer with AVX512 so that `W = 8`. LLVM, the backend compiler of both Julia and Clang, shows rapid performance degradation as `N % 4W` increases, where `N` is the length of the vectors. This is because, to handle the remainder, it uses a scalar loop that runs as written: multiply and add single elements, one after the other. Initially, GCC (gfortran) stumbled in throughput, because it does not use separate accumulation vectors by default except on Power, even with `-funroll-loops`. @@ -36,7 +36,7 @@ LoopVectorization uses `if/ifelse` checks to determine how many extra vectors ar Neither GCC nor LLVM use masks (without LoopVectorization's assitance). -I am not certain, but I believe Intel and GCC check for the vector's alignment, and align them if neccessary. Julia guarantees that the start of arrays beyond a certain size are aligned, so this is not an optimization I have implemented. But it may be worthwhile for handling large matrices with a number of rows that isn't an integer multiple of `W`. For such matrices, the first column may be aligned, but the next will not be. +I am not certain, but I believe Intel and GCC check for the vector's alignment, and align them if necessary. Julia guarantees that the start of arrays beyond a certain size are aligned, so this is not an optimization I have implemented. But it may be worthwhile for handling large matrices with a number of rows that isn't an integer multiple of `W`. For such matrices, the first column may be aligned, but the next will not be. ## Dot-Self diff --git a/docs/src/examples/matrix_multiplication.md b/docs/src/examples/matrix_multiplication.md index 1fcf3a8d7..78c96e7db 100644 --- a/docs/src/examples/matrix_multiplication.md +++ b/docs/src/examples/matrix_multiplication.md @@ -22,7 +22,7 @@ Letting all three matrices be square and `Size` x `Size`, we attain the followin This is classic GEMM, `𝐂 = 𝐀 * 𝐁`. GFortran's intrinsic `matmul` function does fairly well. But all the compilers are well behind LoopVectorization here, which falls behind MKL's `gemm` beyond 70x70 or so. The problem imposed by alignment is also striking: performance is much higher when the sizes are integer multiplies of 8. Padding arrays so that each column is aligned regardless of the number of rows can thus be very profitable. [PaddedMatrices.jl](https://github.com/JuliaSIMD/PaddedMatrices.jl) offers just such arrays in Julia. I believe that is also what the [-pad](https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-pad-qpad) compiler flag does when using Intel's compilers. ![AmulBt](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/bench_AmulBt_v2.svg) -The optimal pattern for `𝐂 = 𝐀 * 𝐁ᵀ` is almost identical to that for `𝐂 = 𝐀 * 𝐁`. Yet, gfortran's `matmul` instrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for `𝐁ᵀ` and creating the ecplicit copy. +The optimal pattern for `𝐂 = 𝐀 * 𝐁ᵀ` is almost identical to that for `𝐂 = 𝐀 * 𝐁`. Yet, gfortran's `matmul` intrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for `𝐁ᵀ` and creating the explicit copy. ifort did equally well whethor or not `𝐁` was transposed, while LoopVectorization's performance degraded slightly faster as a function of size in the transposed case, because strides between memory accesses are larger when `𝐁` is transposed. But it still performed best of all the compiled loops over this size range, losing out to MKL and eventually OpenBLAS. icc interestingly does better when it is transposed. diff --git a/docs/src/examples/multithreading.md b/docs/src/examples/multithreading.md index a1ad1ae90..02d0c17ab 100644 --- a/docs/src/examples/multithreading.md +++ b/docs/src/examples/multithreading.md @@ -188,7 +188,7 @@ end ``` ![complexdot3](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/threadedcomplexdot3product.svg) -When testing on my laptop, the `C` implentation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading, +When testing on my laptop, the `C` implementation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading, or if it's because LoopVectorization's memory access patterns are less friendly. I plan to work on cache-level blocking to increase memory friendliness eventually, and will likely also allow it to take advantage of hyperthreading/simultaneous multithreading, although I'd prefer a few motivating test problems to look at first. Note that a single core of this CPU is capable of exceeding 100 GFLOPS of double precision compute. The execution units are spending most of their time idle. So the question of whether hypthreading helps may be one of whether or not we are memory-limited. @@ -218,7 +218,7 @@ julia> doubles_per_l2 = (2 ^ 20) ÷ 8 julia> total_doubles_in_l2 = doubles_per_l2 * (Sys.CPU_THREADS ÷ 2) # doubles_per_l2 * 18 2359296 -julia> doubles_per_mat = total_doubles_in_l2 ÷ 3 # divide up amoung 3 matrices +julia> doubles_per_mat = total_doubles_in_l2 ÷ 3 # divide up among 3 matrices 786432 julia> sqrt(ans) diff --git a/docs/src/examples/special_functions.md b/docs/src/examples/special_functions.md index 75395e935..868731258 100644 --- a/docs/src/examples/special_functions.md +++ b/docs/src/examples/special_functions.md @@ -16,12 +16,12 @@ end While Intel's proprietary compilers do the best, LoopVectorization performs very well among open source alternatives. A complicating factor to the above benchmark is that in accessing the diagonals, we are not accessing contiguous elements. A benchmark simply exponentiating a vector shows that `gcc` also has efficient special function vectorization, but that the autovectorizer -disagrees with the discontiguous memory acesses: +disagrees with the discontiguous memory accesses: ![selfdot](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/bench_exp_v2.svg) The similar performance between `gfortran` and `LoopVectorization` at multiples of 8 is no fluke: on Linux systems with a recent GLIBC, SLEEFPirates.jl -- which LoopVectorization depends on to vectorize these special functions -- looks for the GNU vector library and uses these functions if available. Otherwise, it will use native Julia implementations that tend to be slower. As the modulus of vector length and vector width (8, on the -host system thanks to AVX512) increases, `gfortran` shows the performance degredation pattern typical of LLVM-vectorized code. +host system thanks to AVX512) increases, `gfortran` shows the performance degradation pattern typical of LLVM-vectorized code. diff --git a/src/codegen/lower_compute.jl b/src/codegen/lower_compute.jl index 86a3c062b..27e7c27a5 100644 --- a/src/codegen/lower_compute.jl +++ b/src/codegen/lower_compute.jl @@ -529,7 +529,7 @@ function lower_compute!( parents_op = parents(op) nparents = length(parents_op) # __u₂max = ls.unrollspecification.u₂ - # TODO: perhaps allos for swithcing unrolled axis again + # TODO: perhaps allow for switching unrolled axis again mvar, u₁unrolledsym, u₂unrolledsym = variable_name_and_unrolled(op, u₁loopsym, u₂loopsym, vloopsym, suffix, ls) opunrolled = u₁unrolledsym || isu₁unrolled(op) diff --git a/src/codegen/lower_load.jl b/src/codegen/lower_load.jl index d9af6a55c..8c5a7f6b7 100644 --- a/src/codegen/lower_load.jl +++ b/src/codegen/lower_load.jl @@ -157,7 +157,7 @@ function pushbroadcast!(q::Expr, mvar::Symbol) ) end -function child_cost_untill_vectorized(op::Operation) +function child_cost_until_vectorized(op::Operation) isvectorized(op) && return 0.0 c = 0.0 for child ∈ children(op) @@ -165,7 +165,7 @@ function child_cost_untill_vectorized(op::Operation) # FIXME: can double count c += instruction_cost(instruction(child)).scalar_reciprocal_throughput + - child_cost_untill_vectorized(child) + child_cost_until_vectorized(child) end end c @@ -174,7 +174,7 @@ function vectorization_profitable(op::Operation) # if op is vectorized itself, return true isvectorized(op) && return true # otherwise, check if descendents until hitting a vectorized portion are expensive enough - child_cost_untill_vectorized(op) ≥ 5 + child_cost_until_vectorized(op) ≥ 5 end function lower_load_no_optranslation!( diff --git a/src/codegen/lower_store.jl b/src/codegen/lower_store.jl index 58fad9fef..ad7a13afb 100644 --- a/src/codegen/lower_store.jl +++ b/src/codegen/lower_store.jl @@ -377,7 +377,7 @@ function lower_tiled_store!( inds_calc_by_ptr_offset = indices_calculated_by_pointer_offsets(ls, op.ref) if donot_tile_store(ls, op, reductfunc, u₂) - # If we have a reductfunc, we're using a reducing store instead of a contiuguous or shuffle store anyway + # If we have a reductfunc, we're using a reducing store instead of a contiguous or shuffle store anyway # so no benefit to being able to handle that case here, vs just calling the default `lower_store!` method @unpack u₁, u₂max = ua for t ∈ 0:u₂-1 @@ -408,7 +408,7 @@ function lower_tiled_store!( u = Core.ifelse(isu₁, u₁, 1) tup = Expr(:tuple) for t ∈ 0:u₂-1 - # tiled stores cannot be loop values, as they're necessarilly + # tiled stores cannot be loop values, as they're necessarily # functions of at least two loops, meaning we do not need to handle them here. push!(tup.args, Symbol(variable_name(opp, ifelse(isu₂, t, -1)), '_', u)) end diff --git a/src/codegen/lower_threads.jl b/src/codegen/lower_threads.jl index e1c0d770e..c3ec229df 100644 --- a/src/codegen/lower_threads.jl +++ b/src/codegen/lower_threads.jl @@ -977,10 +977,10 @@ function avx_threads_expr( LPSYM::Expr ) valid_thread_loop, ua, c = valid_thread_loops(ls) - num_candiates = sum(valid_thread_loop) - if (num_candiates == 0) || (nt ≤ 1) # it was called from `avx_body` but now `nt` was set to `1` + num_candidates = sum(valid_thread_loop) + if (num_candidates == 0) || (nt ≤ 1) # it was called from `avx_body` but now `nt` was set to `1` avx_body(ls, UNROLL) - elseif (num_candiates == 1) || (nt ≤ 3) + elseif (num_candidates == 1) || (nt ≤ 3) thread_one_loops_expr( ls, ua, diff --git a/src/codegen/lowering.jl b/src/codegen/lowering.jl index a058bc0b3..8760e2ecc 100644 --- a/src/codegen/lowering.jl +++ b/src/codegen/lowering.jl @@ -1234,7 +1234,7 @@ function calc_Ureduct!(ls::LoopSet, us::UnrollSpecification) elseif !((u₁ui == Int(u₁u)) & (u₂ui == Int(u₁u))) throw( ArgumentError( - "Doesn't currenly handle differently unrolled reductions yet, please file an issue with an example." + "Doesn't currently handle differently unrolled reductions yet, please file an issue with an example." ) ) end diff --git a/src/condense_loopset.jl b/src/condense_loopset.jl index 716c768fa..403f865f5 100644 --- a/src/condense_loopset.jl +++ b/src/condense_loopset.jl @@ -452,7 +452,7 @@ function should_zerorangestart( allsame = true # The idea here is that if any ref to the same array doesn't have `ind`, # we can't offset that dimension because different inds will clash. - # Because offseting the array means counter-offseting the range, we need + # Because offsetting the array means counter-offsetting the range, we need # to be consistent, and check that all arrays are valid first. for j ∈ @view(namev[2:end]) ref = allarrayrefs[j] diff --git a/src/constructors.jl b/src/constructors.jl index 1aeb78bb0..39fc3bd8c 100644 --- a/src/constructors.jl +++ b/src/constructors.jl @@ -407,7 +407,7 @@ end @tturbo Equivalent to `@turbo`, except it adds `thread=true` as the first keyword argument. -Note that later arguments take precendence. +Note that later arguments take precedence. Meant for convenience, as `@tturbo` is shorter than `@turbo thread=true`. """ diff --git a/src/modeling/determinestrategy.jl b/src/modeling/determinestrategy.jl index b07590c3e..37a66e17f 100644 --- a/src/modeling/determinestrategy.jl +++ b/src/modeling/determinestrategy.jl @@ -201,7 +201,7 @@ function evaluate_cost_unroll( # Need to check if fusion is possible for itersym ∈ order cacheunrolled!(ls, itersym, Symbol(""), vloopsym) - # Add to set of defined symbles + # Add to set of defined symbols push!(nested_loop_syms, itersym) looplength = length(ls, itersym) liter = itersym === vloopsym ? num_iterations(looplength, W) : looplength @@ -1257,7 +1257,7 @@ function evaluate_cost_tile!( elseif itersym == u₂loopsym u₂reached = true end - # Add to set of defined symbles + # Add to set of defined symbols push!(nested_loop_syms, itersym) looplength = length(ls, itersym) iter *= itersym === vloopsym ? num_iterations(looplength, W) : looplength diff --git a/src/modeling/graphs.jl b/src/modeling/graphs.jl index a9ad81b46..3fe93f17e 100644 --- a/src/modeling/graphs.jl +++ b/src/modeling/graphs.jl @@ -1805,7 +1805,7 @@ function push_op!( add_andblock!(ls, ex, elementbytes, position) elseif ex.head === :|| add_orblock!(ls, ex, elementbytes, position) - elseif ex.head === :local # Handle locals introduced by `@inbounds`; using `local` with `@turbo` is not recomended (nor is `@inbounds`; which applies automatically regardless) + elseif ex.head === :local # Handle locals introduced by `@inbounds`; using `local` with `@turbo` is not recommended (nor is `@inbounds`; which applies automatically regardless) @assert length(ex.args) == 1 # TODO replace assert + first with "only" once support for Julia < 1.4 is dropped localbody = first(ex.args) @assert localbody.head === :(=) @@ -2117,7 +2117,7 @@ end """ Returns `0` if the op is the declaration of the constant outerreduction variable. -Returns `n`, where `n` is the constant declarations's index among parents(op), if op is an outter reduction. +Returns `n`, where `n` is the constant declarations's index among parents(op), if op is an outer reduction. Returns `-1` if not an outerreduction. """ function isouterreduction(ls::LoopSet, op::Operation) diff --git a/src/modeling/operations.jl b/src/modeling/operations.jl index 51b5b9509..3ff9b2442 100644 --- a/src/modeling/operations.jl +++ b/src/modeling/operations.jl @@ -163,7 +163,7 @@ loopvalue """ Operation -A structure to encode a particular action occuring inside an `@turbo` block. +A structure to encode a particular action occurring inside an `@turbo` block. # Fields @@ -196,7 +196,7 @@ Each one of these lines is a pretty-printed `Operation`. """ mutable struct Operation <: AbstractLoopOperation """A unique identifier for this operation. - `identifer(op::Operation)` returns the index of this operation within `operations(ls::LoopSet)`.""" + `identifier(op::Operation)` returns the index of this operation within `operations(ls::LoopSet)`.""" identifier::Int """The name of the variable storing the result of this operation. For `a = val` this would be `:a`. For array assignments `A[i,j] = val` this would be `:A`.""" diff --git a/src/parse/memory_ops_common.jl b/src/parse/memory_ops_common.jl index 21a30b108..2d6cf9e15 100644 --- a/src/parse/memory_ops_common.jl +++ b/src/parse/memory_ops_common.jl @@ -585,7 +585,7 @@ function checkforoffset!( if length(mult_syms) == 1 mlt, sym = only(mult_syms) if !byterepresentable(mlt) - # this is so we don't unnecessarilly add a separate offset + # this is so we don't unnecessarily add a separate offset muladd_index!( ls, opparents, diff --git a/test/gemm.jl b/test/gemm.jl index dc0fe5b34..7ef536edf 100644 --- a/test/gemm.jl +++ b/test/gemm.jl @@ -804,7 +804,7 @@ dense!(LoopVectorization.relu, C, A2, B) @test C ≈ LoopVectorization.relu.(@view(A2[:, begin:end-1]) * B .+ @view(A2[:, end])) - @testset "avx $T dynamc gemm" begin + @testset "avx $T dynamic gemm" begin AmulB!(C2, A, B) AmulBavx1!(C, A, B) @test C ≈ C2 diff --git a/test/steprange.jl b/test/steprange.jl index 19246f67a..3b2b2dd4e 100644 --- a/test/steprange.jl +++ b/test/steprange.jl @@ -1,7 +1,7 @@ -# Auxillary functions +# Auxiliary functions const _uint_bit_length = sizeof(UInt) * 8 const _div_uint_size_shift = Int(log2(_uint_bit_length)) @inline _mul2(i::Integer) = i << 1