Skip to content

Commit

Permalink
Omnibus PR - Oct 2023 (#678)
Browse files Browse the repository at this point in the history
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
  commits that affect non-trivial aspects of BLIS. The major highlights:
  - Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
    cntl_t) to the thrinfo_t object. This allows the rntm_t to be
    effectively const (although it is sometimes copied internally and
    modified to reflect different ways of parallelism). Moving the mem_t
    sets the stage for sharing a global control tree amongst all
    threads.
  - De-templatized the macrokernels for gemmt, trmm, and trsm to match
    the macrokernel for gemm, which has been de-templatized since
    54fa28b.
  - Reimplemented bli_l3_determine_kc() by separating out the logic for
    adjusting KC based on MR/NR for triangular A and/or B into a new
    function, bli_l3_adjust_kc(). For now, this function is still called
    from bli_l3_determine_kc(), but in the future we plan to have it
    called once when constructing the control tree.
  - Refactored the level-3 thread decorator into two parts:
    - One part deals only with launching threads, each one calling a
      generic thread entry function. This code resides in frame/thread
      and constitutes the definition of bli_thread_launch(). Note that
      it is specific to the threading implementation (OpenMP, pthreads,
      single, etc.)
    - The other part deals with passing the matrix operands and related
      information into bli_thread_launch(). This is the "l3 decorator"
      and now resides in frame/3. It is agnostic to the threading
      implementation.
  - Modified the "level" of the thread control tree passed in at each
    operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
    passed in a communicator representing the active thread teams which
    would share the available work. Now, the *parent* thread comm is
    passed in. The operation then grabs the child comm and uses it to
    partition the work. The difference is in bli_trsm_blk_var1(), where
    there are now two children nodes for this single operation (i.e. the
    thread control tree is split one level above where the control tree
    is). The sub-prenode is used for the trsm subproblem while the
    normal sub-node is used for the gemm part. Importantly, the parent
    comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
  added back in the future when the control tree's creation is moved so
  that it happens much sooner (provided that bli_*_front() have not been
  absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
  for consistency. This includes _num_threads(), _thread_id(), _n_way(),
  _work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
  _am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
  misspelled.
- (cherry picked from commit aeb5f0c)

Fixed performance bug caused by redundant packing. (#680)

Details:
- Fixed a performance bug whereby multiple threads were redundantly
  packing the same (rather than separate) micropanels. This bug was
  caused by different parts of the code using the num_threads/thread_id
  field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
  standardize on the latter and provide a "fake" thrinfo_t sub-prenode
  in the thrinfo tree which consists of single-member thread teams. The
  single team with multiple threads node is still required since it and
  only it can be used to perform barriers and broadcasts (e.g. of the
  packed buffer pointer).
- (cherry picked from commit 29f79f0)

Fixed random segfault in test/3 drivers. (#788)

Details:
- Fixed a segfault in the non-gemm test drivers in test/3 that was the
  result of sometimes leaving either .n_str or .k_str fields of the
  params_t struct uninitialized, depending on the operation in question.
  For example, in test_hemm.c, init_def_params() would only initialize
  the .m_str and .n_str fields, but not the .k_str field. Even though
  hemm doesn't use a 'k' dimension, the proc_params() function (called
  via parse_cl_params()) universally attempts to convert all three into
  integers via sscanf(), which was understandably failing when one of
  those strings was a NULL pointer. I'm not sure how this code ever
  worked to begin with. Special thanks to Leick Robinson for finding and
  reporting this bug.
- (cherry picked from commit 1236dda)

Fixed staleness in kernels/zen/3/bli_gemm_small.c.

Details:
- Added missing 'const' keyword in function prototypes for
  bli_gemm_small() and friends.
- Updated pba usage to reflect new APIs.
- Fixed syntax typo in 'export GOMP_CPU_AFFINITY' line in ul2128
  conditional of test/3/runme.sh.
- Thanks to Jeff Diamond for reporting these issues.

Allow test/3 drivers to use default ind_t method. (#804)

Details:
- Previously, the standalone performance drivers in test/3 were written
  under the assumption that the user would want to explicitly test
  either native execution *or* 1m. But because the accompanying runme.sh
  script defaults to passing "native" in for the -i command line option
  (which explicitly sets the induced method type), running the script
  without modification causes the test drivers to use slow reference
  microkernels on systems where native complex-domain microkernels are
  not registered -- which will yield poor performance for complex-domain
  level-3 operations. Furthermore, even if a user was aware of this, the
  test drivers did not support any single value for the -i option that
  would test BLIS using the library's default behavior -- that is, using
  1m on systems where it is needed and native execution on systems that
  have native microkernels implemented and registered.
- This commit addresses the aforementioned issue by supporting a new
  value for the -i option: "auto". The "auto" value causes the driver
  to avoid explicitly setting the induced method altogether, leaving
  BLIS's default behavior in place. This "auto" option is also now the
  default setting within the runme.sh script. Thanks to Leick Robinson
  for finding and reporting this issue.
- Also added support for "nat" as a shorthand for "native", which
  the help text already (erroneously) claimed was supported.
- (cherry picked from commit fd1a7e3)
  • Loading branch information
fgvanzee committed Apr 25, 2024
1 parent 751d0a1 commit 4cf2a99
Show file tree
Hide file tree
Showing 224 changed files with 5,263 additions and 11,150 deletions.
10 changes: 5 additions & 5 deletions addon/gemmd/attic/bao_gemmd_bp_var2.c
Original file line number Diff line number Diff line change
Expand Up @@ -386,8 +386,8 @@ void PASTECH2(bao_,ch,varname) \
/* Query the number of threads and thread ids for the JR loop.
NOTE: These values are only needed when computing the next
micropanel of B. */ \
const dim_t jr_nt = bli_thread_n_way( thread_jr ); \
const dim_t jr_tid = bli_thread_work_id( thread_jr ); \
const dim_t jr_nt = bli_thrinfo_n_way( thread_jr ); \
const dim_t jr_tid = bli_thrinfo_work_id( thread_jr ); \
\
/* Compute number of primary and leftover components of the JR loop. */ \
dim_t jr_iter = ( nc_cur + NR - 1 ) / NR; \
Expand Down Expand Up @@ -416,8 +416,8 @@ void PASTECH2(bao_,ch,varname) \
/* Query the number of threads and thread ids for the IR loop.
NOTE: These values are only needed when computing the next
micropanel of A. */ \
const dim_t ir_nt = bli_thread_n_way( thread_ir ); \
const dim_t ir_tid = bli_thread_work_id( thread_ir ); \
const dim_t ir_nt = bli_thrinfo_n_way( thread_ir ); \
const dim_t ir_tid = bli_thrinfo_work_id( thread_ir ); \
\
/* Compute number of primary and leftover components of the IR loop. */ \
dim_t ir_iter = ( mc_cur + MR - 1 ) / MR; \
Expand Down Expand Up @@ -476,7 +476,7 @@ void PASTECH2(bao_,ch,varname) \
/* This barrier is needed to prevent threads from starting to pack
the next row panel of B before the current row panel is fully
computed upon. */ \
bli_thread_barrier( thread_pb ); \
bli_thrinfo_barrier( thread_pb ); \
} \
} \
\
Expand Down
10 changes: 5 additions & 5 deletions addon/gemmd/bao_gemmd_bp_var1.c
Original file line number Diff line number Diff line change
Expand Up @@ -370,8 +370,8 @@ void PASTECH2(bao_,ch,varname) \
/* Query the number of threads and thread ids for the JR loop.
NOTE: These values are only needed when computing the next
micropanel of B. */ \
const dim_t jr_nt = bli_thread_n_way( thread_jr ); \
const dim_t jr_tid = bli_thread_work_id( thread_jr ); \
const dim_t jr_nt = bli_thrinfo_n_way( thread_jr ); \
const dim_t jr_tid = bli_thrinfo_work_id( thread_jr ); \
\
/* Compute number of primary and leftover components of the JR loop. */ \
dim_t jr_iter = ( nc_cur + NR - 1 ) / NR; \
Expand Down Expand Up @@ -400,8 +400,8 @@ void PASTECH2(bao_,ch,varname) \
/* Query the number of threads and thread ids for the IR loop.
NOTE: These values are only needed when computing the next
micropanel of A. */ \
const dim_t ir_nt = bli_thread_n_way( thread_ir ); \
const dim_t ir_tid = bli_thread_work_id( thread_ir ); \
const dim_t ir_nt = bli_thrinfo_n_way( thread_ir ); \
const dim_t ir_tid = bli_thrinfo_work_id( thread_ir ); \
\
/* Compute number of primary and leftover components of the IR loop. */ \
dim_t ir_iter = ( mc_cur + MR - 1 ) / MR; \
Expand Down Expand Up @@ -458,7 +458,7 @@ void PASTECH2(bao_,ch,varname) \
/* This barrier is needed to prevent threads from starting to pack
the next row panel of B before the current row panel is fully
computed upon. */ \
bli_thread_barrier( rntm, thread_pb ); \
bli_thrinfo_barrier( thread_pb ); \
} \
} \
\
Expand Down
10 changes: 5 additions & 5 deletions addon/gemmd/bao_l3_packm_a.c
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Barrier to make sure all threads are caught up and ready to begin the
packm stage. */ \
bli_thread_barrier( rntm, thread ); \
bli_thrinfo_barrier( thread ); \
\
/* Compute the size of the memory block eneded. */ \
siz_t size_needed = sizeof( ctype ) * m_pack * k_pack; \
Expand Down Expand Up @@ -90,7 +90,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Broadcast the address of the chief thread's passed-in mem_t to all
threads. */ \
mem_t* mem_p = bli_thread_broadcast( rntm, thread, mem ); \
mem_t* mem_p = bli_thrinfo_broadcast( thread, mem ); \
\
/* Non-chief threads: Copy the contents of the chief thread's
passed-in mem_t to the passed-in mem_t for this thread. (The
Expand Down Expand Up @@ -139,7 +139,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Broadcast the address of the chief thread's passed-in mem_t
to all threads. */ \
mem_t* mem_p = bli_thread_broadcast( rntm, thread, mem ); \
mem_t* mem_p = bli_thrinfo_broadcast( thread, mem ); \
\
/* Non-chief threads: Copy the contents of the chief thread's
passed-in mem_t to the passed-in mem_t for this thread. (The
Expand Down Expand Up @@ -313,13 +313,13 @@ void PASTECH2(bao_,ch,opname) \
d, incd, \
a, rs_a, cs_a, \
*p, *rs_p, *cs_p, \
pd_p, *ps_p, \
pd_p, *ps_p, \
cntx, \
thread \
); \
\
/* Barrier so that packing is done before computation. */ \
bli_thread_barrier( rntm, thread ); \
bli_thrinfo_barrier( thread ); \
}

//INSERT_GENTFUNC_BASIC0( packm_a )
Expand Down
10 changes: 5 additions & 5 deletions addon/gemmd/bao_l3_packm_b.c
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Barrier to make sure all threads are caught up and ready to begin the
packm stage. */ \
bli_thread_barrier( rntm, thread ); \
bli_thrinfo_barrier( thread ); \
\
/* Compute the size of the memory block eneded. */ \
siz_t size_needed = sizeof( ctype ) * k_pack * n_pack; \
Expand Down Expand Up @@ -90,7 +90,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Broadcast the address of the chief thread's passed-in mem_t to all
threads. */ \
mem_t* mem_p = bli_thread_broadcast( rntm, thread, mem ); \
mem_t* mem_p = bli_thrinfo_broadcast( thread, mem ); \
\
/* Non-chief threads: Copy the contents of the chief thread's
passed-in mem_t to the passed-in mem_t for this thread. (The
Expand Down Expand Up @@ -139,7 +139,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Broadcast the address of the chief thread's passed-in mem_t
to all threads. */ \
mem_t* mem_p = bli_thread_broadcast( rntm, thread, mem ); \
mem_t* mem_p = bli_thrinfo_broadcast( thread, mem ); \
\
/* Non-chief threads: Copy the contents of the chief thread's
passed-in mem_t to the passed-in mem_t for this thread. (The
Expand Down Expand Up @@ -313,13 +313,13 @@ void PASTECH2(bao_,ch,opname) \
d, incd, \
b, rs_b, cs_b, \
*p, *rs_p, *cs_p, \
pd_p, *ps_p, \
pd_p, *ps_p, \
cntx, \
thread \
); \
\
/* Barrier so that packing is done before computation. */ \
bli_thread_barrier( rntm, thread ); \
bli_thrinfo_barrier( thread ); \
}

//INSERT_GENTFUNC_BASIC0( packm_b )
Expand Down
4 changes: 2 additions & 2 deletions addon/gemmd/bao_l3_packm_var1.c
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,8 @@ void PASTECH2(bao_,ch,varname) \
\
/* Query the number of threads and thread ids from the current thread's
packm thrinfo_t node. */ \
const dim_t nt = bli_thread_n_way( thread ); \
const dim_t tid = bli_thread_work_id( thread ); \
const dim_t nt = bli_thrinfo_n_way( thread ); \
const dim_t tid = bli_thrinfo_work_id( thread ); \
\
/* Suppress warnings in case tid isn't used (ie: as in slab partitioning). */ \
( void )nt; \
Expand Down
4 changes: 2 additions & 2 deletions addon/gemmd/bao_l3_packm_var2.c
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,8 @@ void PASTECH2(bao_,ch,varname) \
\
/* Query the number of threads and thread ids from the current thread's
packm thrinfo_t node. */ \
const dim_t nt = bli_thread_n_way( thread ); \
const dim_t tid = bli_thread_work_id( thread ); \
const dim_t nt = bli_thrinfo_n_way( thread ); \
const dim_t tid = bli_thrinfo_work_id( thread ); \
\
/* Suppress warnings in case tid isn't used (ie: as in slab partitioning). */ \
( void )nt; \
Expand Down
Loading

0 comments on commit 4cf2a99

Please sign in to comment.