You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Flops look correct, but the bandwidth numbers for the 5-d local kernels look way off. A quick look at the code suggests that these aren't special cased, and the default DslashCuda::bytes() is being used, but I'll need to check for more than 30 seconds to work this out exactly.
Tuned block=(128,3,1), shared=24577 giving 71.00 Gflop/s, 1116.09 GB/s for N4quda16MDWFDslashPCCudaI7double2S1_EE with type=single-GPU,reconstruct=18,Dslash4pre
Tuned block=(32,4,1), shared=0 giving 170.86 Gflop/s, 347.94 GB/s for N4quda16MDWFDslashPCCudaI7double2S1_EE with type=single-GPU,reconstruct=18,Dslash4
Tuned block=(32,4,1), shared=16385 giving 155.89 Gflop/s, 181.87 GB/s for N4quda16MDWFDslashPCCudaI7double2S1_EE with type=single-GPU,reconstruct=18,Dslash5inv
Tuned block=(32,2,1), shared=24577 giving 64.99 Gflop/s, 895.91 GB/s for N4quda16MDWFDslashPCCudaI7double2S1_EE with type=single-GPU,reconstruct=18,Xpay,Dslash5
Executing 10 kernel loops...
The text was updated successfully, but these errors were encountered:
Flops look correct, but the bandwidth numbers for the 5-d local kernels look way off. A quick look at the code suggests that these aren't special cased, and the default
DslashCuda::bytes()
is being used, but I'll need to check for more than 30 seconds to work this out exactly.The text was updated successfully, but these errors were encountered: