JuliaSIMD · chriselrod · Apr 2, 2023 · Apr 1, 2023
diff --git a/docs/src/devdocs/loopset_structure.md b/docs/src/devdocs/loopset_structure.md
@@ -56,7 +56,7 @@ References to arrays are represented with an `ArrayReferenceMeta` data structure
 julia> LoopVectorization.operations(lsAmulB)[3].ref
 LoopVectorization.ArrayReferenceMeta(LoopVectorization.ArrayReference(:A, [:m, :k], Int8[0, 0]), Bool[1, 1], Symbol("##vptr##_A"))
 ```
-It contains the name of the parent array (`:A`), the indicies `[:m,:k]`, and a boolean vector (`Bool[1, 1]`) indicating whether these indices are loop iterables. Note that the optimizer assumes arrays are column-major, and thus that it is efficient to read contiguous elements from the first index. In lower level terms, it means that [high-throughput vmov](https://www.felixcloutier.com/x86/movupd) instructions can be used rather than [low-throughput](https://www.felixcloutier.com/x86/vgatherdpd:vgatherqpd) [gathers](https://www.felixcloutier.com/x86/vgatherqps:vgatherqpd). Similar story for storing elements.
+It contains the name of the parent array (`:A`), the indices `[:m,:k]`, and a boolean vector (`Bool[1, 1]`) indicating whether these indices are loop iterables. Note that the optimizer assumes arrays are column-major, and thus that it is efficient to read contiguous elements from the first index. In lower level terms, it means that [high-throughput vmov](https://www.felixcloutier.com/x86/movupd) instructions can be used rather than [low-throughput](https://www.felixcloutier.com/x86/vgatherdpd:vgatherqpd) [gathers](https://www.felixcloutier.com/x86/vgatherqps:vgatherqpd). Similar story for storing elements.
 When no axis has unit stride, the first given index will be the dummy `Symbol("##DISCONTIGUOUSSUBARRAY##")`.
 
 !!! warning

diff --git a/docs/src/examples/array_interface.md b/docs/src/examples/array_interface.md
@@ -9,7 +9,7 @@ LoopVectorization uses [ArrayInterface.jl](https://github.com/SciML/ArrayInterfa
 that wasn't optimized by `LoopVectorization`, but instead simply had `@inbounds @fastmath` applied to the loop. This can often still yield reasonable to good performance, saving you from having to write more than one version of the loop
 to get good performance and correct behavior just because the array types happen to be different.
 
-By supporting the interface, using `LoopVectorization` can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of [a few hundred lines of code](https://github.com/JuliaArrays/StaticArrays.jl/blob/0e431022954f0207eeb2c4f661b9f76936105c8a/src/matrix_multiply.jl#L4) to define matix multiplication in `StaticArrays`, one could simply write:
+By supporting the interface, using `LoopVectorization` can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of [a few hundred lines of code](https://github.com/JuliaArrays/StaticArrays.jl/blob/0e431022954f0207eeb2c4f661b9f76936105c8a/src/matrix_multiply.jl#L4) to define matrix multiplication in `StaticArrays`, one could simply write:
 ```julia
 using StaticArrays, LoopVectorization
 

diff --git a/docs/src/examples/dot_product.md b/docs/src/examples/dot_product.md
@@ -25,7 +25,7 @@ Thus, in 4 clock cycles, we can do up to 8 loads. But each `fma` requires 2 load
 
 Double precision benchmarks pitting Julia's builtin dot product, and code compiled with a variety of compilers:
 ![dot](https://raw.githubusercontent.com/JuliaSIMD/LoopVectorization.jl/docsassets/docs/src/assets/bench_dot_v2.svg)
-What we just described is the core of the approach used by all these compilers. The variation in results is explained mostly by how they handle vectors with lengths that are not an integer multiple of `W`. I ran these on a computer with AVX512 so that `W = 8`. LLVM, the backend compiler of both Julia and Clang, shows rapid performance degredation as `N % 4W` increases, where `N` is the length of the vectors.
+What we just described is the core of the approach used by all these compilers. The variation in results is explained mostly by how they handle vectors with lengths that are not an integer multiple of `W`. I ran these on a computer with AVX512 so that `W = 8`. LLVM, the backend compiler of both Julia and Clang, shows rapid performance degradation as `N % 4W` increases, where `N` is the length of the vectors.
 This is because, to handle the remainder, it uses a scalar loop that runs as written: multiply and add single elements, one after the other. 
 
 Initially, GCC (gfortran) stumbled in throughput, because it does not use separate accumulation vectors by default except on Power, even with `-funroll-loops`.
@@ -36,7 +36,7 @@ LoopVectorization uses `if/ifelse` checks to determine how many extra vectors ar
 
 Neither GCC nor LLVM use masks (without LoopVectorization's assitance).
 
-I am not certain, but I believe Intel and GCC check for the vector's alignment, and align them if neccessary. Julia guarantees that the start of arrays beyond a certain size are aligned, so this is not an optimization I have implemented. But it may be worthwhile for handling large matrices with a number of rows that isn't an integer multiple of `W`. For such matrices, the first column may be aligned, but the next will not be.
+I am not certain, but I believe Intel and GCC check for the vector's alignment, and align them if necessary. Julia guarantees that the start of arrays beyond a certain size are aligned, so this is not an optimization I have implemented. But it may be worthwhile for handling large matrices with a number of rows that isn't an integer multiple of `W`. For such matrices, the first column may be aligned, but the next will not be.
 
 ## Dot-Self
 

diff --git a/docs/src/examples/matrix_multiplication.md b/docs/src/examples/matrix_multiplication.md
@@ -22,7 +22,7 @@ Letting all three matrices be square and `Size` x `Size`, we attain the followin
 This is classic GEMM, `𝐂 = 𝐀 * 𝐁`. GFortran's intrinsic `matmul` function does fairly well. But all the compilers are well behind LoopVectorization here, which falls behind MKL's `gemm` beyond 70x70 or so. The problem imposed by alignment is also striking: performance is much higher when the sizes are integer multiplies of 8. Padding arrays so that each column is aligned regardless of the number of rows can thus be very profitable. [PaddedMatrices.jl](https://github.com/JuliaSIMD/PaddedMatrices.jl) offers just such arrays in Julia. I believe that is also what the [-pad](https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-pad-qpad) compiler flag does when using Intel's compilers.
 
 ![AmulBt](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/bench_AmulBt_v2.svg)
-The optimal pattern for `𝐂 = 𝐀 * 𝐁ᵀ` is almost identical to that for `𝐂 = 𝐀 * 𝐁`. Yet, gfortran's `matmul` instrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for `𝐁ᵀ` and creating the ecplicit copy.
+The optimal pattern for `𝐂 = 𝐀 * 𝐁ᵀ` is almost identical to that for `𝐂 = 𝐀 * 𝐁`. Yet, gfortran's `matmul` intrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for `𝐁ᵀ` and creating the explicit copy.
 
 ifort did equally well whethor or not `𝐁` was transposed, while LoopVectorization's performance degraded slightly faster as a function of size in the transposed case, because strides between memory accesses are larger when `𝐁` is transposed. But it still performed best of all the compiled loops over this size range, losing out to MKL and eventually OpenBLAS.
 icc interestingly does better when it is transposed.

diff --git a/docs/src/examples/multithreading.md b/docs/src/examples/multithreading.md
@@ -188,7 +188,7 @@ end
 ```
 ![complexdot3](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/threadedcomplexdot3product.svg)
 
-When testing on my laptop, the `C` implentation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading,
+When testing on my laptop, the `C` implementation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading,
 or if it's because LoopVectorization's memory access patterns are less friendly.
 I plan to work on cache-level blocking to increase memory friendliness eventually, and will likely also allow it to take advantage of hyperthreading/simultaneous multithreading, although I'd prefer a few motivating test problems to look at first. Note that a single core of this CPU is capable of exceeding 100 GFLOPS of double precision compute. The execution units are spending most of their time idle. So the question of whether hypthreading helps may be one of whether or not we are memory-limited.
 
@@ -218,7 +218,7 @@ julia> doubles_per_l2 = (2 ^ 20) ÷ 8
 julia> total_doubles_in_l2 = doubles_per_l2 * (Sys.CPU_THREADS ÷ 2) # doubles_per_l2 * 18
 2359296
 
-julia> doubles_per_mat = total_doubles_in_l2 ÷ 3 # divide up amoung 3 matrices
+julia> doubles_per_mat = total_doubles_in_l2 ÷ 3 # divide up among 3 matrices
 786432
 
 julia> sqrt(ans)

diff --git a/docs/src/examples/special_functions.md b/docs/src/examples/special_functions.md
@@ -16,12 +16,12 @@ end
 While Intel's proprietary compilers do the best, LoopVectorization performs very well among open source alternatives. A complicating
 factor to the above benchmark is that in accessing the diagonals, we are not accessing contiguous elements. A benchmark
 simply exponentiating a vector shows that `gcc` also has efficient special function vectorization, but that the autovectorizer
-disagrees with the discontiguous memory acesses:
+disagrees with the discontiguous memory accesses:
 
 ![selfdot](https://github.com/JuliaSIMD/LoopVectorization.jl/raw/docsassets/docs/src/assets/bench_exp_v2.svg)
 
 The similar performance between `gfortran` and `LoopVectorization` at multiples of 8 is no fluke: on Linux systems with a recent GLIBC, SLEEFPirates.jl --
 which LoopVectorization depends on to vectorize these special functions -- looks for the GNU vector library and uses these functions
 if available. Otherwise, it will use native Julia implementations that tend to be slower. As the modulus of vector length and vector width (8, on the
-host system thanks to AVX512) increases, `gfortran` shows the performance degredation pattern typical of LLVM-vectorized code.
+host system thanks to AVX512) increases, `gfortran` shows the performance degradation pattern typical of LLVM-vectorized code.
 
diff --git a/src/codegen/lower_compute.jl b/src/codegen/lower_compute.jl
@@ -529,7 +529,7 @@ function lower_compute!(
   parents_op = parents(op)
   nparents = length(parents_op)
   # __u₂max = ls.unrollspecification.u₂
-  # TODO: perhaps allos for swithcing unrolled axis again
+  # TODO: perhaps allow for switching unrolled axis again
   mvar, u₁unrolledsym, u₂unrolledsym =
     variable_name_and_unrolled(op, u₁loopsym, u₂loopsym, vloopsym, suffix, ls)
   opunrolled = u₁unrolledsym || isu₁unrolled(op)

diff --git a/src/codegen/lower_load.jl b/src/codegen/lower_load.jl
@@ -157,15 +157,15 @@ function pushbroadcast!(q::Expr, mvar::Symbol)
   )
 end
 
-function child_cost_untill_vectorized(op::Operation)
+function child_cost_until_vectorized(op::Operation)
   isvectorized(op) && return 0.0
   c = 0.0
   for child ∈ children(op)
     if (!isvectorized(child) & iscompute(child))
       # FIXME: can double count
       c +=
         instruction_cost(instruction(child)).scalar_reciprocal_throughput +
-        child_cost_untill_vectorized(child)
+        child_cost_until_vectorized(child)
     end
   end
   c
@@ -174,7 +174,7 @@ function vectorization_profitable(op::Operation)
   # if op is vectorized itself, return true
   isvectorized(op) && return true
   # otherwise, check if descendents until hitting a vectorized portion are expensive enough
-  child_cost_untill_vectorized(op) ≥ 5
+  child_cost_until_vectorized(op) ≥ 5
 end
 
 function lower_load_no_optranslation!(

diff --git a/src/codegen/lower_store.jl b/src/codegen/lower_store.jl
@@ -377,7 +377,7 @@ function lower_tiled_store!(
   inds_calc_by_ptr_offset = indices_calculated_by_pointer_offsets(ls, op.ref)
 
   if donot_tile_store(ls, op, reductfunc, u₂)
-    # If we have a reductfunc, we're using a reducing store instead of a contiuguous or shuffle store anyway
+    # If we have a reductfunc, we're using a reducing store instead of a contiguous or shuffle store anyway
     # so no benefit to being able to handle that case here, vs just calling the default `lower_store!` method
     @unpack u₁, u₂max = ua
     for t ∈ 0:u₂-1
@@ -408,7 +408,7 @@ function lower_tiled_store!(
   u = Core.ifelse(isu₁, u₁, 1)
   tup = Expr(:tuple)
   for t ∈ 0:u₂-1
-    # tiled stores cannot be loop values, as they're necessarilly
+    # tiled stores cannot be loop values, as they're necessarily
     # functions of at least two loops, meaning we do not need to handle them here.
     push!(tup.args, Symbol(variable_name(opp, ifelse(isu₂, t, -1)), '_', u))
   end

diff --git a/src/codegen/lower_threads.jl b/src/codegen/lower_threads.jl
@@ -977,10 +977,10 @@ function avx_threads_expr(
   LPSYM::Expr
 )
   valid_thread_loop, ua, c = valid_thread_loops(ls)
-  num_candiates = sum(valid_thread_loop)
-  if (num_candiates == 0) || (nt ≤ 1) # it was called from `avx_body` but now `nt` was set to `1`
+  num_candidates = sum(valid_thread_loop)
+  if (num_candidates == 0) || (nt ≤ 1) # it was called from `avx_body` but now `nt` was set to `1`
     avx_body(ls, UNROLL)
-  elseif (num_candiates == 1) || (nt ≤ 3)
+  elseif (num_candidates == 1) || (nt ≤ 3)
     thread_one_loops_expr(
       ls,
       ua,

diff --git a/src/codegen/lowering.jl b/src/codegen/lowering.jl
@@ -1234,7 +1234,7 @@ function calc_Ureduct!(ls::LoopSet, us::UnrollSpecification)
       elseif !((u₁ui == Int(u₁u)) & (u₂ui == Int(u₁u)))
         throw(
           ArgumentError(
-            "Doesn't currenly handle differently unrolled reductions yet, please file an issue with an example."
+            "Doesn't currently handle differently unrolled reductions yet, please file an issue with an example."
           )
         )
       end

diff --git a/src/condense_loopset.jl b/src/condense_loopset.jl
@@ -452,7 +452,7 @@ function should_zerorangestart(
       allsame = true
       # The idea here is that if any ref to the same array doesn't have `ind`,
       # we can't offset that dimension because different inds will clash.
-      # Because offseting the array means counter-offseting the range, we need
+      # Because offsetting the array means counter-offsetting the range, we need
       # to be consistent, and check that all arrays are valid first.
       for j ∈ @view(namev[2:end])
         ref = allarrayrefs[j]

diff --git a/src/constructors.jl b/src/constructors.jl
@@ -407,7 +407,7 @@ end
     @tturbo
 
 Equivalent to `@turbo`, except it adds `thread=true` as the first keyword argument.
-Note that later arguments take precendence.
+Note that later arguments take precedence.
 
 Meant for convenience, as `@tturbo` is shorter than `@turbo thread=true`.
 """

diff --git a/src/modeling/determinestrategy.jl b/src/modeling/determinestrategy.jl
@@ -201,7 +201,7 @@ function evaluate_cost_unroll(
   # Need to check if fusion is possible
   for itersym ∈ order
     cacheunrolled!(ls, itersym, Symbol(""), vloopsym)
-    # Add to set of defined symbles
+    # Add to set of defined symbols
     push!(nested_loop_syms, itersym)
     looplength = length(ls, itersym)
     liter = itersym === vloopsym ? num_iterations(looplength, W) : looplength
@@ -1257,7 +1257,7 @@ function evaluate_cost_tile!(
     elseif itersym == u₂loopsym
       u₂reached = true
     end
-    # Add to set of defined symbles
+    # Add to set of defined symbols
     push!(nested_loop_syms, itersym)
     looplength = length(ls, itersym)
     iter *= itersym === vloopsym ? num_iterations(looplength, W) : looplength

diff --git a/src/modeling/graphs.jl b/src/modeling/graphs.jl
@@ -1805,7 +1805,7 @@ function push_op!(
     add_andblock!(ls, ex, elementbytes, position)
   elseif ex.head === :||
     add_orblock!(ls, ex, elementbytes, position)
-  elseif ex.head === :local # Handle locals introduced by `@inbounds`; using `local` with `@turbo` is not recomended (nor is `@inbounds`; which applies automatically regardless)
+  elseif ex.head === :local # Handle locals introduced by `@inbounds`; using `local` with `@turbo` is not recommended (nor is `@inbounds`; which applies automatically regardless)
     @assert length(ex.args) == 1 # TODO replace assert + first with "only" once support for Julia < 1.4 is dropped
     localbody = first(ex.args)
     @assert localbody.head === :(=)
@@ -2117,7 +2117,7 @@ end
 
 """
 Returns `0` if the op is the declaration of the constant outerreduction variable.
-Returns `n`, where `n` is the constant declarations's index among parents(op), if op is an outter reduction.
+Returns `n`, where `n` is the constant declarations's index among parents(op), if op is an outer reduction.
 Returns `-1` if not an outerreduction.
 """
 function isouterreduction(ls::LoopSet, op::Operation)

diff --git a/src/modeling/operations.jl b/src/modeling/operations.jl
@@ -163,7 +163,7 @@ loopvalue
 """
     Operation
 
-A structure to encode a particular action occuring inside an `@turbo` block.
+A structure to encode a particular action occurring inside an `@turbo` block.
 
 # Fields
 
@@ -196,7 +196,7 @@ Each one of these lines is a pretty-printed `Operation`.
 """
 mutable struct Operation <: AbstractLoopOperation
   """A unique identifier for this operation.
-  `identifer(op::Operation)` returns the index of this operation within `operations(ls::LoopSet)`."""
+  `identifier(op::Operation)` returns the index of this operation within `operations(ls::LoopSet)`."""
   identifier::Int
   """The name of the variable storing the result of this operation.
   For `a = val` this would be `:a`. For array assignments `A[i,j] = val` this would be `:A`."""

diff --git a/src/parse/memory_ops_common.jl b/src/parse/memory_ops_common.jl
@@ -585,7 +585,7 @@ function checkforoffset!(
     if length(mult_syms) == 1
       mlt, sym = only(mult_syms)
       if !byterepresentable(mlt)
-        # this is so we don't unnecessarilly add a separate offset
+        # this is so we don't unnecessarily add a separate offset
         muladd_index!(
           ls,
           opparents,

diff --git a/test/gemm.jl b/test/gemm.jl
@@ -804,7 +804,7 @@
         dense!(LoopVectorization.relu, C, A2, B)
         @test C ≈
               LoopVectorization.relu.(@view(A2[:, begin:end-1]) * B .+ @view(A2[:, end]))
-        @testset "avx $T dynamc gemm" begin
+        @testset "avx $T dynamic gemm" begin
           AmulB!(C2, A, B)
           AmulBavx1!(C, A, B)
           @test C ≈ C2

diff --git a/test/steprange.jl b/test/steprange.jl
@@ -1,7 +1,7 @@
 
 
 
-# Auxillary functions
+# Auxiliary functions
 const _uint_bit_length = sizeof(UInt) * 8
 const _div_uint_size_shift = Int(log2(_uint_bit_length))
 @inline _mul2(i::Integer) = i << 1