`dslash_test` failing with `QUDA_REORDER_LOCATION=CPU` #1466

sbacchio · 2024-05-22T08:44:41Z

In the current version of the develop, dslash_test is failing when QUDA_REORDER_LOCATION=CPU is set.

Reporducible:

export QUDA_REORDER_LOCATION=CPU
dslash_test --xdim 4 --ydim 4 --zdim 4 --tdim 4

Output:

Disabling GPU-Direct RDMA access
Enabling peer-to-peer copy engine and direct load/store access
Rank order is column major (t running fastest)
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from DslashTest
QUDA 1.1.0 (git 1.1.0-f8855bbec-sm_80)
CUDA Driver version = 12010
CUDA Runtime version = 11080
Graphic driver version = 530.30.02
Found device 0: NVIDIA A100-SXM-64GB
Using device 0: NVIDIA A100-SXM-64GB
WARNING: Data reordering done on CPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Environment variable QUDA_RESOURCE_PATH is not set.
WARNING: Caching of tuned parameters will be disabled.
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
cublasCreated successfully
[ RUN      ] DslashTest.benchmark
Randomizing fields... Sending gauge field to GPU
Creating cudaSpinor with nParity = 1
Creating cudaSpinorOut with nParity = 1
Sending spinor field to GPU
Source: CPU = 2.032196e+03, CUDA = 2.032196e+03
running the following test:
prec    recon   dtest_type     matpc_type   dagger   S_dim         T_dimension   Ls_dimension dslash_type    niter
single   18       Dslash              even_even    0      4/  4/  4          8             16           wilson   100
Grid partition info:     X  Y  Z  T
                         0  0  0  0
Tuning...
Executing 100 kernel loops...
done.

10.659840us per kernel call
337920 flops per kernel call, 1320 flops per site 1344 bytes per site
GFLOPS = 31.700288
GBYTES = 32.276656
Effective halo bi-directional bandwidth (GB/s) GPU = 0.000000 ( CPU = 0.000000, min = 0.000000 , max = 0.000000 ) for aggregate message size 0 bytes
[       OK ] DslashTest.benchmark (128 ms)
[ RUN      ] DslashTest.verify
Sending gauge field to GPU
Creating cudaSpinor with nParity = 1
Creating cudaSpinorOut with nParity = 1
Sending spinor field to GPU
Source: CPU = 2.032196e+03, CUDA = 2.032196e+03
running the following test:
prec    recon   dtest_type     matpc_type   dagger   S_dim         T_dimension   Ls_dimension dslash_type    niter
single   18       Dslash              even_even    0      4/  4/  4          8             16           wilson   100
Grid partition info:     X  Y  Z  T
                         0  0  0  0
Calculating reference implementation...done.
Tuning...
Executing 2 kernel loops...
done.

13.312000us per kernel call
337920 flops per kernel call, 1320 flops per site 1344 bytes per site
GFLOPS = 25.384615
GBYTES = 25.846153
Effective halo bi-directional bandwidth (GB/s) GPU = 0.000000 ( CPU = 0.000000, min = 0.000000 , max = 0.000000 ) for aggregate message size 0 bytes
Results: reference = 49469.949399, QUDA = 0.000000, L2 relative deviation = 1.000000e+00, max deviation = 9.964920e+00
0 fails = 255
1 fails = 254
2 fails = 256
3 fails = 256
4 fails = 256
5 fails = 255
6 fails = 256
7 fails = 256
8 fails = 254
9 fails = 255
10 fails = 256
11 fails = 255
12 fails = 255
13 fails = 256
14 fails = 253
15 fails = 256
16 fails = 255
17 fails = 255
18 fails = 254
19 fails = 254
20 fails = 255
21 fails = 256
22 fails = 255
23 fails = 254
1.000000e-01 Failures: 4494 / 6144  = 7.314453e-01
1.000000e-02 Failures: 5961 / 6144  = 9.702148e-01
1.000000e-03 Failures: 6122 / 6144  = 9.964193e-01
1.000000e-04 Failures: 6140 / 6144  = 9.993490e-01
1.000000e-05 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-06 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-07 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-08 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-09 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-10 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-11 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-12 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-13 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-14 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-15 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-16 Failures: 6144 / 6144  = 1.000000e+00
/leonardo/home/userexternal/sbacchio/src/quda_new/tests/dslash_test.cpp:78: Failure
Expected: (deviation) <= (tol), actual: 1 vs 0.0001
CPU and CUDA implementations do not agree
[  FAILED  ] DslashTest.verify (14 ms)

               initQuda Total time =     2.729 secs
                     init     =     2.729 secs (100.000%),	 with        2 calls at 1.364e+06 us per call
        total accounted       =     2.729 secs (100.000%)
        total missing         =     0.000 secs (  0.000%)

          loadGaugeQuda Total time =     0.048 secs
                 download     =     0.045 secs ( 94.961%),	 with        2 calls at 2.264e+04 us per call
                     init     =     0.000 secs (  0.369%),	 with       10 calls at 1.760e+01 us per call
                  compute     =     0.000 secs (  0.002%),	 with        2 calls at 5.000e-01 us per call
                     free     =     0.000 secs (  0.006%),	 with       73 calls at 4.110e-02 us per call
        total accounted       =     0.045 secs ( 95.339%)
        total missing         =     0.002 secs (  4.661%)

                endQuda Total time =     0.002 secs
                     free     =     0.000 secs (  0.120%),	 with       63 calls at 3.175e-02 us per call
        total accounted       =     0.000 secs (  0.120%)
        total missing         =     0.002 secs ( 99.880%)

       initQuda-endQuda Total time =     2.874 secs

                   QUDA Total time =     2.778 secs
                 download     =     0.045 secs (  1.630%),	 with        2 calls at 2.264e+04 us per call
                     init     =     2.729 secs ( 98.230%),	 with       12 calls at 2.274e+05 us per call
                  compute     =     0.000 secs (  0.000%),	 with        2 calls at 5.000e-01 us per call
                     free     =     0.000 secs (  0.000%),	 with      136 calls at 4.412e-02 us per call
        total accounted       =     2.774 secs ( 99.860%)
        total missing         =     0.004 secs (  0.140%)

Device memory used = 0.4 MiB
Pinned device memory used = 0.0 MiB
Managed memory used = 0.0 MiB
Shmem memory used = 0.0 MiB
Page-locked host memory used = 0.3 MiB
Total host memory used >= 1.1 MiB

[----------] 2 tests from DslashTest (143 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (2875 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DslashTest.verify

 1 FAILED TEST

The text was updated successfully, but these errors were encountered:

sbacchio mentioned this issue May 22, 2024

Feature/DD #1447

Open

14 tasks

sbacchio added the bug label May 22, 2024

sbacchio changed the title ~~dslash_test failing with QUDA_REORDER_LOCATION=CPU~~ dslash_test failing with QUDA_REORDER_LOCATION=CPU May 22, 2024

maddyscientist mentioned this issue Jun 5, 2024

Some miscellaneous fixes #1469

Merged

maddyscientist closed this as completed in #1469 Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dslash_test` failing with `QUDA_REORDER_LOCATION=CPU` #1466

`dslash_test` failing with `QUDA_REORDER_LOCATION=CPU` #1466

sbacchio commented May 22, 2024

dslash_test failing with QUDA_REORDER_LOCATION=CPU #1466

dslash_test failing with QUDA_REORDER_LOCATION=CPU #1466

Comments

sbacchio commented May 22, 2024

`dslash_test` failing with `QUDA_REORDER_LOCATION=CPU` #1466

`dslash_test` failing with `QUDA_REORDER_LOCATION=CPU` #1466