Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dslash_test failing with QUDA_REORDER_LOCATION=CPU #1466

Closed
sbacchio opened this issue May 22, 2024 · 0 comments · Fixed by #1469
Closed

dslash_test failing with QUDA_REORDER_LOCATION=CPU #1466

sbacchio opened this issue May 22, 2024 · 0 comments · Fixed by #1469
Labels

Comments

@sbacchio
Copy link
Member

In the current version of the develop, dslash_test is failing when QUDA_REORDER_LOCATION=CPU is set.

Reporducible:

export QUDA_REORDER_LOCATION=CPU
dslash_test --xdim 4 --ydim 4 --zdim 4 --tdim 4
Output:
Disabling GPU-Direct RDMA access
Enabling peer-to-peer copy engine and direct load/store access
Rank order is column major (t running fastest)
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from DslashTest
QUDA 1.1.0 (git 1.1.0-f8855bbec-sm_80)
CUDA Driver version = 12010
CUDA Runtime version = 11080
Graphic driver version = 530.30.02
Found device 0: NVIDIA A100-SXM-64GB
Using device 0: NVIDIA A100-SXM-64GB
WARNING: Data reordering done on CPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Environment variable QUDA_RESOURCE_PATH is not set.
WARNING: Caching of tuned parameters will be disabled.
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
cublasCreated successfully
[ RUN      ] DslashTest.benchmark
Randomizing fields... Sending gauge field to GPU
Creating cudaSpinor with nParity = 1
Creating cudaSpinorOut with nParity = 1
Sending spinor field to GPU
Source: CPU = 2.032196e+03, CUDA = 2.032196e+03
running the following test:
prec    recon   dtest_type     matpc_type   dagger   S_dim         T_dimension   Ls_dimension dslash_type    niter
single   18       Dslash              even_even    0      4/  4/  4          8             16           wilson   100
Grid partition info:     X  Y  Z  T
                         0  0  0  0
Tuning...
Executing 100 kernel loops...
done.

10.659840us per kernel call
337920 flops per kernel call, 1320 flops per site 1344 bytes per site
GFLOPS = 31.700288
GBYTES = 32.276656
Effective halo bi-directional bandwidth (GB/s) GPU = 0.000000 ( CPU = 0.000000, min = 0.000000 , max = 0.000000 ) for aggregate message size 0 bytes
[       OK ] DslashTest.benchmark (128 ms)
[ RUN      ] DslashTest.verify
Sending gauge field to GPU
Creating cudaSpinor with nParity = 1
Creating cudaSpinorOut with nParity = 1
Sending spinor field to GPU
Source: CPU = 2.032196e+03, CUDA = 2.032196e+03
running the following test:
prec    recon   dtest_type     matpc_type   dagger   S_dim         T_dimension   Ls_dimension dslash_type    niter
single   18       Dslash              even_even    0      4/  4/  4          8             16           wilson   100
Grid partition info:     X  Y  Z  T
                         0  0  0  0
Calculating reference implementation...done.
Tuning...
Executing 2 kernel loops...
done.

13.312000us per kernel call
337920 flops per kernel call, 1320 flops per site 1344 bytes per site
GFLOPS = 25.384615
GBYTES = 25.846153
Effective halo bi-directional bandwidth (GB/s) GPU = 0.000000 ( CPU = 0.000000, min = 0.000000 , max = 0.000000 ) for aggregate message size 0 bytes
Results: reference = 49469.949399, QUDA = 0.000000, L2 relative deviation = 1.000000e+00, max deviation = 9.964920e+00
0 fails = 255
1 fails = 254
2 fails = 256
3 fails = 256
4 fails = 256
5 fails = 255
6 fails = 256
7 fails = 256
8 fails = 254
9 fails = 255
10 fails = 256
11 fails = 255
12 fails = 255
13 fails = 256
14 fails = 253
15 fails = 256
16 fails = 255
17 fails = 255
18 fails = 254
19 fails = 254
20 fails = 255
21 fails = 256
22 fails = 255
23 fails = 254
1.000000e-01 Failures: 4494 / 6144  = 7.314453e-01
1.000000e-02 Failures: 5961 / 6144  = 9.702148e-01
1.000000e-03 Failures: 6122 / 6144  = 9.964193e-01
1.000000e-04 Failures: 6140 / 6144  = 9.993490e-01
1.000000e-05 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-06 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-07 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-08 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-09 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-10 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-11 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-12 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-13 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-14 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-15 Failures: 6144 / 6144  = 1.000000e+00
1.000000e-16 Failures: 6144 / 6144  = 1.000000e+00
/leonardo/home/userexternal/sbacchio/src/quda_new/tests/dslash_test.cpp:78: Failure
Expected: (deviation) <= (tol), actual: 1 vs 0.0001
CPU and CUDA implementations do not agree
[  FAILED  ] DslashTest.verify (14 ms)

               initQuda Total time =     2.729 secs
                     init     =     2.729 secs (100.000%),	 with        2 calls at 1.364e+06 us per call
        total accounted       =     2.729 secs (100.000%)
        total missing         =     0.000 secs (  0.000%)

          loadGaugeQuda Total time =     0.048 secs
                 download     =     0.045 secs ( 94.961%),	 with        2 calls at 2.264e+04 us per call
                     init     =     0.000 secs (  0.369%),	 with       10 calls at 1.760e+01 us per call
                  compute     =     0.000 secs (  0.002%),	 with        2 calls at 5.000e-01 us per call
                     free     =     0.000 secs (  0.006%),	 with       73 calls at 4.110e-02 us per call
        total accounted       =     0.045 secs ( 95.339%)
        total missing         =     0.002 secs (  4.661%)

                endQuda Total time =     0.002 secs
                     free     =     0.000 secs (  0.120%),	 with       63 calls at 3.175e-02 us per call
        total accounted       =     0.000 secs (  0.120%)
        total missing         =     0.002 secs ( 99.880%)

       initQuda-endQuda Total time =     2.874 secs

                   QUDA Total time =     2.778 secs
                 download     =     0.045 secs (  1.630%),	 with        2 calls at 2.264e+04 us per call
                     init     =     2.729 secs ( 98.230%),	 with       12 calls at 2.274e+05 us per call
                  compute     =     0.000 secs (  0.000%),	 with        2 calls at 5.000e-01 us per call
                     free     =     0.000 secs (  0.000%),	 with      136 calls at 4.412e-02 us per call
        total accounted       =     2.774 secs ( 99.860%)
        total missing         =     0.004 secs (  0.140%)

Device memory used = 0.4 MiB
Pinned device memory used = 0.0 MiB
Managed memory used = 0.0 MiB
Shmem memory used = 0.0 MiB
Page-locked host memory used = 0.3 MiB
Total host memory used >= 1.1 MiB

[----------] 2 tests from DslashTest (143 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (2875 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DslashTest.verify

 1 FAILED TEST
@sbacchio sbacchio mentioned this issue May 22, 2024
14 tasks
@sbacchio sbacchio added the bug label May 22, 2024
@sbacchio sbacchio changed the title dslash_test failing with QUDA_REORDER_LOCATION=CPU dslash_test failing with QUDA_REORDER_LOCATION=CPU May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant