We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dslash_test
QUDA_REORDER_LOCATION=CPU
In the current version of the develop, dslash_test is failing when QUDA_REORDER_LOCATION=CPU is set.
Reporducible:
export QUDA_REORDER_LOCATION=CPU dslash_test --xdim 4 --ydim 4 --zdim 4 --tdim 4
Disabling GPU-Direct RDMA access Enabling peer-to-peer copy engine and direct load/store access Rank order is column major (t running fastest) [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from DslashTest QUDA 1.1.0 (git 1.1.0-f8855bbec-sm_80) CUDA Driver version = 12010 CUDA Runtime version = 11080 Graphic driver version = 530.30.02 Found device 0: NVIDIA A100-SXM-64GB Using device 0: NVIDIA A100-SXM-64GB WARNING: Data reordering done on CPU (set with QUDA_REORDER_LOCATION=GPU/CPU) WARNING: Environment variable QUDA_RESOURCE_PATH is not set. WARNING: Caching of tuned parameters will be disabled. WARNING: Using device memory pool allocator WARNING: Using pinned memory pool allocator cublasCreated successfully [ RUN ] DslashTest.benchmark Randomizing fields... Sending gauge field to GPU Creating cudaSpinor with nParity = 1 Creating cudaSpinorOut with nParity = 1 Sending spinor field to GPU Source: CPU = 2.032196e+03, CUDA = 2.032196e+03 running the following test: prec recon dtest_type matpc_type dagger S_dim T_dimension Ls_dimension dslash_type niter single 18 Dslash even_even 0 4/ 4/ 4 8 16 wilson 100 Grid partition info: X Y Z T 0 0 0 0 Tuning... Executing 100 kernel loops... done. 10.659840us per kernel call 337920 flops per kernel call, 1320 flops per site 1344 bytes per site GFLOPS = 31.700288 GBYTES = 32.276656 Effective halo bi-directional bandwidth (GB/s) GPU = 0.000000 ( CPU = 0.000000, min = 0.000000 , max = 0.000000 ) for aggregate message size 0 bytes [ OK ] DslashTest.benchmark (128 ms) [ RUN ] DslashTest.verify Sending gauge field to GPU Creating cudaSpinor with nParity = 1 Creating cudaSpinorOut with nParity = 1 Sending spinor field to GPU Source: CPU = 2.032196e+03, CUDA = 2.032196e+03 running the following test: prec recon dtest_type matpc_type dagger S_dim T_dimension Ls_dimension dslash_type niter single 18 Dslash even_even 0 4/ 4/ 4 8 16 wilson 100 Grid partition info: X Y Z T 0 0 0 0 Calculating reference implementation...done. Tuning... Executing 2 kernel loops... done. 13.312000us per kernel call 337920 flops per kernel call, 1320 flops per site 1344 bytes per site GFLOPS = 25.384615 GBYTES = 25.846153 Effective halo bi-directional bandwidth (GB/s) GPU = 0.000000 ( CPU = 0.000000, min = 0.000000 , max = 0.000000 ) for aggregate message size 0 bytes Results: reference = 49469.949399, QUDA = 0.000000, L2 relative deviation = 1.000000e+00, max deviation = 9.964920e+00 0 fails = 255 1 fails = 254 2 fails = 256 3 fails = 256 4 fails = 256 5 fails = 255 6 fails = 256 7 fails = 256 8 fails = 254 9 fails = 255 10 fails = 256 11 fails = 255 12 fails = 255 13 fails = 256 14 fails = 253 15 fails = 256 16 fails = 255 17 fails = 255 18 fails = 254 19 fails = 254 20 fails = 255 21 fails = 256 22 fails = 255 23 fails = 254 1.000000e-01 Failures: 4494 / 6144 = 7.314453e-01 1.000000e-02 Failures: 5961 / 6144 = 9.702148e-01 1.000000e-03 Failures: 6122 / 6144 = 9.964193e-01 1.000000e-04 Failures: 6140 / 6144 = 9.993490e-01 1.000000e-05 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-06 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-07 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-08 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-09 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-10 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-11 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-12 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-13 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-14 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-15 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-16 Failures: 6144 / 6144 = 1.000000e+00 /leonardo/home/userexternal/sbacchio/src/quda_new/tests/dslash_test.cpp:78: Failure Expected: (deviation) <= (tol), actual: 1 vs 0.0001 CPU and CUDA implementations do not agree [ FAILED ] DslashTest.verify (14 ms) initQuda Total time = 2.729 secs init = 2.729 secs (100.000%), with 2 calls at 1.364e+06 us per call total accounted = 2.729 secs (100.000%) total missing = 0.000 secs ( 0.000%) loadGaugeQuda Total time = 0.048 secs download = 0.045 secs ( 94.961%), with 2 calls at 2.264e+04 us per call init = 0.000 secs ( 0.369%), with 10 calls at 1.760e+01 us per call compute = 0.000 secs ( 0.002%), with 2 calls at 5.000e-01 us per call free = 0.000 secs ( 0.006%), with 73 calls at 4.110e-02 us per call total accounted = 0.045 secs ( 95.339%) total missing = 0.002 secs ( 4.661%) endQuda Total time = 0.002 secs free = 0.000 secs ( 0.120%), with 63 calls at 3.175e-02 us per call total accounted = 0.000 secs ( 0.120%) total missing = 0.002 secs ( 99.880%) initQuda-endQuda Total time = 2.874 secs QUDA Total time = 2.778 secs download = 0.045 secs ( 1.630%), with 2 calls at 2.264e+04 us per call init = 2.729 secs ( 98.230%), with 12 calls at 2.274e+05 us per call compute = 0.000 secs ( 0.000%), with 2 calls at 5.000e-01 us per call free = 0.000 secs ( 0.000%), with 136 calls at 4.412e-02 us per call total accounted = 2.774 secs ( 99.860%) total missing = 0.004 secs ( 0.140%) Device memory used = 0.4 MiB Pinned device memory used = 0.0 MiB Managed memory used = 0.0 MiB Shmem memory used = 0.0 MiB Page-locked host memory used = 0.3 MiB Total host memory used >= 1.1 MiB [----------] 2 tests from DslashTest (143 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (2875 ms total) [ PASSED ] 1 test. [ FAILED ] 1 test, listed below: [ FAILED ] DslashTest.verify 1 FAILED TEST
The text was updated successfully, but these errors were encountered:
Successfully merging a pull request may close this issue.
In the current version of the develop,
dslash_test
is failing whenQUDA_REORDER_LOCATION=CPU
is set.Reporducible:
Output:
The text was updated successfully, but these errors were encountered: