Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuned bandwidth numbers for some 5-d operators looks wrong #315

Closed
maddyscientist opened this issue Jul 14, 2015 · 2 comments
Closed

Tuned bandwidth numbers for some 5-d operators looks wrong #315

maddyscientist opened this issue Jul 14, 2015 · 2 comments
Labels
Milestone

Comments

@maddyscientist
Copy link
Member

Flops look correct, but the bandwidth numbers for the 5-d local kernels look way off. A quick look at the code suggests that these aren't special cased, and the default DslashCuda::bytes() is being used, but I'll need to check for more than 30 seconds to work this out exactly.

Tuned block=(128,3,1), shared=24577 giving 71.00 Gflop/s, 1116.09 GB/s for N4quda16MDWFDslashPCCudaI7double2S1_EE with type=single-GPU,reconstruct=18,Dslash4pre
Tuned block=(32,4,1), shared=0 giving 170.86 Gflop/s, 347.94 GB/s for N4quda16MDWFDslashPCCudaI7double2S1_EE with type=single-GPU,reconstruct=18,Dslash4
Tuned block=(32,4,1), shared=16385 giving 155.89 Gflop/s, 181.87 GB/s for N4quda16MDWFDslashPCCudaI7double2S1_EE with type=single-GPU,reconstruct=18,Dslash5inv
Tuned block=(32,2,1), shared=24577 giving 64.99 Gflop/s, 895.91 GB/s for N4quda16MDWFDslashPCCudaI7double2S1_EE with type=single-GPU,reconstruct=18,Xpay,Dslash5
Executing 10 kernel loops...
@maddyscientist maddyscientist added this to the QUDA 0.7.2 milestone Jul 14, 2015
@mathiaswagner
Copy link
Member

What volume was this? Did you check limiting cases?

@mathiaswagner
Copy link
Member

Closed with #317.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants