Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fusedL2NN: Fix updateReDucedVal with >2 rows/warp
In updateReDucedVal, a single warp can contain multiple rows (in registers). A single thread within the warp uses the first element of each row to update an output array (atomically). In the previous implementation, a shuffle was used to move the head of each row into the first thread of the warp. Unfortunately, this would overwrite the value all other rows. This strategy, however, worked when the number of rows per warp equalled 2. Hence, the bug never triggered. In a recent commit, the number of rows per warp was increased to four in certain situations (skinny matrices). Hence, this bug triggered. In the new implementation, the values are not shuffled into the first thread of the warp any more. Instead, the threads that contain the first element of a row update the output in sequential order. The sequential ordering is necessary to avoid deadlock on Pascal architecture.
- Loading branch information