Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CUDA 11.2 libcuml++ C++ test failures EDIT: Updated with 11.2 update 2 #3406

Closed
8 tasks done
dantegd opened this issue Jan 24, 2021 · 6 comments · Fixed by #3696
Closed
8 tasks done

[BUG] CUDA 11.2 libcuml++ C++ test failures EDIT: Updated with 11.2 update 2 #3406

dantegd opened this issue Jan 24, 2021 · 6 comments · Fixed by #3696
Assignees
Labels
bug Something isn't working CUDA / C++ CUDA issue tests Unit testing for project

Comments

@dantegd
Copy link
Member

dantegd commented Jan 24, 2021

Describe the bug
During the weekend I was updating my workstation to add CUDA 11.2 as one of the installed versions, and ran the libcuml++ C++ tests to see potential issues we could run into. Prims tests all passed. Since we have (relatively very) few failures, I'm opening this issue to track things.

Updated with cuda 11.2 update 2 results (identical to update 1)

Steps/Code to reproduce bug
Build libcuml++ with CUDA 11.2, running in my 3080 workstation these are the tests that fail:

Summary:

Note I only get the BatchedLevelAlgo fails if I run multiple tests (like the whole suite), but not if I just run the BatchedLevelAlgo tests only

Detailed failures:

  • UMAP Parametrizable fail:
[----------] 1 test from UMAPParametrizableTest
[ RUN      ] UMAPParametrizableTest.Result

umap_params : [15-2-0-0-true]
test_params : [false-false-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.509951

umap_params : [15-2-0-0-true]
test_params : [true-false-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.986557

umap_params : [15-2-0-0-true]
test_params : [false-true-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.50496

umap_params : [15-2-0-0-true]
test_params : [false-false-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.986196

umap_params : [15-2-0-0-true]
test_params : [true-true-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.987441

umap_params : [15-2-0-0-true]
test_params : [true-false-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.983834

umap_params : [15-2-0-0-true]
test_params : [false-true-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.986384

umap_params : [15-2-0-0-true]
test_params : [true-true-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.975491

umap_params : [15-10-0-0-true]
test_params : [false-false-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.981664

umap_params : [15-10-0-0-true]
test_params : [true-false-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.990134

umap_params : [15-10-0-0-true]
test_params : [false-true-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.981818

umap_params : [15-10-0-0-true]
test_params : [false-false-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.990429

umap_params : [15-10-0-0-true]
test_params : [true-true-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.99194

umap_params : [15-10-0-0-true]
test_params : [true-false-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.991115

umap_params : [15-10-0-0-true]
test_params : [false-true-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.991024

umap_params : [15-10-0-0-true]
test_params : [true-true-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.992093

umap_params : [15-21-500-42-false]
test_params : [false-false-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.621274

umap_params : [15-21-500-42-false]
test_params : [true-false-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.990169
Not equal, difference : 10.6728
../test/sg/umap_parametrizable_test.cu:270: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-21-500-42-false]
test_params : [false-true-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.621554

umap_params : [15-21-500-42-false]
test_params : [false-false-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.990938

umap_params : [15-21-500-42-false]
test_params : [true-true-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.992397
Not equal, difference : 5.52223
../test/sg/umap_parametrizable_test.cu:270: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-21-500-42-false]
test_params : [true-false-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.991514
Not equal, difference : 5.13622
../test/sg/umap_parametrizable_test.cu:270: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-21-500-42-false]
test_params : [false-true-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.991589

umap_params : [15-21-500-42-false]
test_params : [true-true-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.991794
Not equal, difference : 1.25765
../test/sg/umap_parametrizable_test.cu:270: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-25-500-42-false]
test_params : [false-false-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.981726

umap_params : [15-25-500-42-false]
test_params : [true-false-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.990131
Not equal, difference : 10.9862
../test/sg/umap_parametrizable_test.cu:270: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-25-500-42-false]
test_params : [false-true-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.981879

umap_params : [15-25-500-42-false]
test_params : [false-false-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.990918

umap_params : [15-25-500-42-false]
test_params : [true-true-false-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.992429
Not equal, difference : 9.8503
../test/sg/umap_parametrizable_test.cu:270: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-25-500-42-false]
test_params : [true-false-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.991559
Not equal, difference : 3.44948
../test/sg/umap_parametrizable_test.cu:270: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-25-500-42-false]
test_params : [false-true-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.991617

umap_params : [15-25-500-42-false]
test_params : [true-true-true-2000-50-20-0.45]
min. expected trustworthiness: 0.45
trustworthiness: 0.991803
Not equal, difference : 3.33448
../test/sg/umap_parametrizable_test.cu:270: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true
[  FAILED  ] UMAPParametrizableTest.Result (56277 ms)
  • Batched Level Algo:
[ RUN      ] BatchedLevelAlgo/DtRegTestF.Test/2
../test/sg/decisiontree_batchedlevel_algo.cu:168: Failure
Expected equality of these values:
  depth
    Which is: 7
  inparams.max_depth
    Which is: 8
[  FAILED  ] BatchedLevelAlgo/DtRegTestF.Test/2, where GetParam() =  (5 ms)
[ RUN      ] BatchedLevelAlgo/DtRegTestF.Test/3
../test/sg/decisiontree_batchedlevel_algo.cu:168: Failure
Expected equality of these values:
  depth
    Which is: 7
  inparams.max_depth
    Which is: 8
[  FAILED  ] BatchedLevelAlgo/DtRegTestF.Test/3, where GetParam() =  (5 ms)
[----------] 4 tests from BatchedLevelAlgo/DtRegTestF (21 ms total)

Environment details (please complete the following information):

  • Environment location: Bare metal
  • Linux Distro/Architecture: Ubuntu 20.04 AMD64
  • GPU Model/Driver: 3080 / 460.56
  • CUDA: 11.2.67 / 11.2.142 (update 1)
  • Method of cuDF & cuML install: libcuml++ built from source

Old failures before update 1:

- Random Forest:
```bash
[ RUN      ] BatchedLevelAlgo/DtRegTestF.Test/2
../test/sg/decisiontree_batchedlevel_algo.cu:170: Failure
Expected equality of these values:
  depth
    Which is: 4
  inparams.max_depth
    Which is: 8
[  FAILED  ] BatchedLevelAlgo/DtRegTestF.Test/2, where GetParam() =  (6 ms)
[ RUN      ] BatchedLevelAlgo/DtRegTestF.Test/3
../test/sg/decisiontree_batchedlevel_algo.cu:170: Failure
Expected equality of these values:
  depth
    Which is: 4
  inparams.max_depth
    Which is: 8
[  FAILED  ] BatchedLevelAlgo/DtRegTestF.Test/3, where GetParam() =  (5 ms)
  • (experimental) LARS tests
[ RUN      ] LarsTestFitPredict/0.fitGram
../test/sg/lars_test.cu:369: Failure
Value of: raft::devArrMatchHost(beta_exp, beta.data(), n_cols, raft::CompareApprox<math_t>(1e-5))
  Actual: false (actual=40.926303863525391 != expected=39.051303863525391 @1; actual=25.034524917602539 != expected=26.909526824951172 @3; )
Expected: true
[  FAILED  ] LarsTestFitPredict/0.fitGram, where TypeParam = float (4 ms)

[ RUN      ] LarsTestFitPredict/0.fitX
../test/sg/lars_test.cu:387: Failure
Value of: raft::devArrMatchHost(beta_exp, beta.data(), n_cols, raft::CompareApprox<math_t>(1e-5))
  Actual: false (actual=149.71788024902344 != expected=74.858940124511719 @0; actual=79.977607727050781 != expected=39.051303863525391 @1; actual=76.382537841796875 != expected=38.1912841796875 @2; actual=51.944034576416016 != expected=26.909526824951172 @3; actual=-0.094914264976978302 != expected=-0.047454498708248138 @4; )
Expected: true
[  FAILED  ] LarsTestFitPredict/0.fitX, where TypeParam = float (2 ms)

[ RUN      ] LarsTestFitPredict/1.fitGram
../test/sg/lars_test.cu:369: Failure
Value of: raft::devArrMatchHost(beta_exp, beta.data(), n_cols, raft::CompareApprox<math_t>(1e-5))
  Actual: false (actual=30285123824.28841 != expected=74.858938899999998 @0; actual=1562040623.4382577 != expected=39.051302499999998 @1; )
Expected: true
[  FAILED  ] LarsTestFitPredict/1.fitGram, where TypeParam = double (2 ms)

[ RUN      ] LarsTestFitPredict/1.fitX
../test/sg/lars_test.cu:387: Failure
Value of: raft::devArrMatchHost(beta_exp, beta.data(), n_cols, raft::CompareApprox<math_t>(1e-5))
  Actual: false (actual=30285123899.14735 != expected=74.858938899999998 @0; actual=1562040662.4895601 != expected=39.051302499999998 @1; actual=76.3825641880333 != expected=38.191282299999997 @2; actual=53.8190553135627 != expected=26.909527700000002 @3; actual=-0.094908712028415235 != expected=-0.047454500099999998 @4; )
Expected: true
[  FAILED  ] LarsTestFitPredict/1.fitX, where TypeParam = double (3 ms)
  • Quasi Newton Failure:
[ RUN      ] QuasiNewtonTest.binary_logistic_vs_sklearn
../test/sg/quasi_newton.cu:168: Failure
Value of: compApprox(obj_l1_b, fx)
  Actual: false
Expected: true
[  FAILED  ] QuasiNewtonTest.binary_logistic_vs_sklearn (41 ms)
@dantegd dantegd added bug Something isn't working tests Unit testing for project CUDA / C++ CUDA issue labels Jan 24, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@dantegd dantegd changed the title [BUG] CUDA 11.2 libcuml++ C++ test failures [BUG] CUDA 11.2 libcuml++ C++ test failures EDIT: Updated with 11.2 update 1 Feb 24, 2021
@dantegd
Copy link
Member Author

dantegd commented Feb 24, 2021

Results have been updated with 11.2 update 1

@dantegd dantegd changed the title [BUG] CUDA 11.2 libcuml++ C++ test failures EDIT: Updated with 11.2 update 1 [BUG] CUDA 11.2 libcuml++ C++ test failures EDIT: Updated with 11.2 update 2 Mar 23, 2021
@cjnolet
Copy link
Member

cjnolet commented Mar 24, 2021

W/ 11.2.2 I get failures in the following gtests on a V100 in Centos7:

[----------] Global test environment tear-down
[==========] 791 tests from 100 test suites ran. (153618 ms total)
[  PASSED  ] 788 tests.
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] UMAPParametrizableTest.Result
[  FAILED  ] BatchedLevelAlgo/DtRegTestF.Test/2, where GetParam() = 
[  FAILED  ] BatchedLevelAlgo/DtRegTestF.Test/3, where GetParam() = 
/home/cjnolet/workspace/cuml/cpp/test/sg/umap_parametrizable_test.cu:269: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-21-500-42-false]
test_params : [false-true-false-2000-50-20-0.45]
nnz: 60000
nnz: 82900
min. expected trustworthiness: 0.45
trustworthiness: 0.746958

umap_params : [15-21-500-42-false]
test_params : [false-false-true-2000-50-20-0.45]
nnz: 60000
min. expected trustworthiness: 0.45
trustworthiness: 0.990969

umap_params : [15-21-500-42-false]
test_params : [true-true-false-2000-50-20-0.45]
nnz: 60000
nnz: 82900
min. expected trustworthiness: 0.45
trustworthiness: 0.992524
nnz: 60000
nnz: 82900
Not equal, difference : 6.92243
/home/cjnolet/workspace/cuml/cpp/test/sg/umap_parametrizable_test.cu:269: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-21-500-42-false]
test_params : [true-false-true-2000-50-20-0.45]
nnz: 60000
min. expected trustworthiness: 0.45
trustworthiness: 0.991587
nnz: 60000
Not equal, difference : 5.98251
/home/cjnolet/workspace/cuml/cpp/test/sg/umap_parametrizable_test.cu:269: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-21-500-42-false]
test_params : [false-true-true-2000-50-20-0.45]
nnz: 60000
nnz: 82900
min. expected trustworthiness: 0.45
trustworthiness: 0.991586

umap_params : [15-21-500-42-false]
test_params : [true-true-true-2000-50-20-0.45]
nnz: 60000
nnz: 82900
min. expected trustworthiness: 0.45
trustworthiness: 0.992444
nnz: 60000
nnz: 82900
Not equal, difference : 3.43222
/home/cjnolet/workspace/cuml/cpp/test/sg/umap_parametrizable_test.cu:269: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-25-500-42-false]
test_params : [false-false-false-2000-50-20-0.45]
nnz: 60000
min. expected trustworthiness: 0.45
trustworthiness: 0.981863

umap_params : [15-25-500-42-false]
test_params : [true-false-false-2000-50-20-0.45]
nnz: 60000
min. expected trustworthiness: 0.45
trustworthiness: 0.990235
nnz: 60000
Not equal, difference : 5.83882
/home/cjnolet/workspace/cuml/cpp/test/sg/umap_parametrizable_test.cu:269: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-25-500-42-false]
test_params : [false-true-false-2000-50-20-0.45]
nnz: 60000
nnz: 82900
min. expected trustworthiness: 0.45
trustworthiness: 0.981869

umap_params : [15-25-500-42-false]
test_params : [false-false-true-2000-50-20-0.45]
nnz: 60000
min. expected trustworthiness: 0.45
trustworthiness: 0.990964

umap_params : [15-25-500-42-false]
test_params : [true-true-false-2000-50-20-0.45]
nnz: 60000
nnz: 82900
min. expected trustworthiness: 0.45
trustworthiness: 0.992548
nnz: 60000
nnz: 82900
Not equal, difference : 10.7388
/home/cjnolet/workspace/cuml/cpp/test/sg/umap_parametrizable_test.cu:269: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-25-500-42-false]
test_params : [true-false-true-2000-50-20-0.45]
nnz: 60000
min. expected trustworthiness: 0.45
trustworthiness: 0.991597
nnz: 60000
Not equal, difference : 4.10813
/home/cjnolet/workspace/cuml/cpp/test/sg/umap_parametrizable_test.cu:269: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true

umap_params : [15-25-500-42-false]
test_params : [false-true-true-2000-50-20-0.45]
nnz: 60000
nnz: 82900
min. expected trustworthiness: 0.45
trustworthiness: 0.991597

umap_params : [15-25-500-42-false]
test_params : [true-true-true-2000-50-20-0.45]
nnz: 60000
nnz: 82900
min. expected trustworthiness: 0.45
trustworthiness: 0.992507
nnz: 60000
nnz: 82900
Not equal, difference : 7.30424
/home/cjnolet/workspace/cuml/cpp/test/sg/umap_parametrizable_test.cu:269: Failure
Value of: are_equal(e1, e2, n_samples * umap_params.n_components, alloc, stream)
  Actual: false
Expected: true
[  FAILED  ] UMAPParametrizableTest.Result (21463 ms)
[----------] 1 test from UMAPParametrizableTest (21463 ms total)
[----------] 4 tests from BatchedLevelAlgo/DtRegTestF
[ RUN      ] BatchedLevelAlgo/DtRegTestF.Test/0
[       OK ] BatchedLevelAlgo/DtRegTestF.Test/0 (6 ms)
[ RUN      ] BatchedLevelAlgo/DtRegTestF.Test/1
[       OK ] BatchedLevelAlgo/DtRegTestF.Test/1 (6 ms)
[ RUN      ] BatchedLevelAlgo/DtRegTestF.Test/2
/home/cjnolet/workspace/cuml/cpp/test/sg/decisiontree_batchedlevel_algo.cu:168: Failure
Expected equality of these values:
  depth
    Which is: 7
  inparams.max_depth
    Which is: 8
[  FAILED  ] BatchedLevelAlgo/DtRegTestF.Test/2, where GetParam() =  (7 ms)
[ RUN      ] BatchedLevelAlgo/DtRegTestF.Test/3
/home/cjnolet/workspace/cuml/cpp/test/sg/decisiontree_batchedlevel_algo.cu:168: Failure
Expected equality of these values:
  depth
    Which is: 7
  inparams.max_depth
    Which is: 8
[  FAILED  ] BatchedLevelAlgo/DtRegTestF.Test/3, where GetParam() =  (6 ms)

@vinaydes
Copy link
Contributor

vinaydes commented Apr 1, 2021

@venkywonka Has found and fixed a bug in the BatchedLevelAlgo test. That should fix this issue.

@cjnolet cjnolet self-assigned this Apr 1, 2021
@cjnolet
Copy link
Member

cjnolet commented Apr 1, 2021

I'm working to figure out the UMAP side.

@venkywonka venkywonka self-assigned this Apr 1, 2021
@venkywonka
Copy link
Contributor

working on fixing the BatchedLevelAlgo test bug, will send a PR asap

rapids-bot bot pushed a commit that referenced this issue Apr 1, 2021
* This PR fixes the regressions shown by `BatchedLevelAlgo/DtClsTestF` and `BatchedLevelAlgo/DtRegTestF` wherein the quantiles parameter passed to `grow_tree` function was uninitialized garbage memory as opposed to what should have been quantiles computed for each column. 
* It also replaces the old method of computing quantiles (`preprocess_quantiles`) with new, more accurate one (`computeQuantiles`)
* removes an unnecessary memory allocation to `tempmem` in the setup phase of the test fixture.
* This fixes failing `BatchedLevelAlgo/DtRegTestF` tests as reported in issue #3406 
* It also fixes failing `BatchedLevelAlgo/DtClsTestF` tests in PR #3616

cc @teju85 @vinaydes @JohnZed @hcho3

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Thejaswi. N. S (https://github.com/teju85)
  - John Zedlewski (https://github.com/JohnZed)

URL: #3690
@rapids-bot rapids-bot bot closed this as completed in #3696 Apr 8, 2021
rapids-bot bot pushed a commit that referenced this issue Apr 8, 2021
Closes #3406.

There's a couple things to note in this PR:

1. There is a kernel in the UMAP gtest that computes an L1 between two different embeddings and it wasn't using atomics for the addition so we may have just not been seeing these failures until now
2. The Python code wasn't failing because it's using random init instead of spectral. I'm going to open an issue in RAFT to investigate why the spectral init differs from run to run with the same inputs even when random seed is set.

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Victor Lafargue (https://github.com/viclafargue)
  - John Zedlewski (https://github.com/JohnZed)

URL: #3696
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CUDA / C++ CUDA issue tests Unit testing for project
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants