Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Review] Fix memory bug for SVM with large n_rows #3420

Merged

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented Jan 28, 2021

closes #3233

This PR introduces 64bit int arithmetic while accessing SVM kernel tile to avoid overflow.

@tfeher tfeher requested review from a team as code owners January 28, 2021 17:14
@tfeher tfeher added bug Something isn't working non-breaking Non-breaking change and removed CMake libcuml labels Jan 28, 2021
@tfeher
Copy link
Contributor Author

tfeher commented Jan 28, 2021

This PR depends on rapidsai/raft#131, I will update the raft path once that is merged.

@JohnZed JohnZed added the 0 - Blocked Cannot progress due to external reasons label Jan 29, 2021
Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few tiny questions/suggestions. I marked this as blocked so we remember to merge into RAFT first and re-enable the test.

cpp/src/svm/kernelcache.cuh Outdated Show resolved Hide resolved
cpp/src/svm/kernelcache.cuh Show resolved Hide resolved
cpp/src/svm/smoblocksolve.cuh Show resolved Hide resolved
cpp/test/sg/svc_test.cu Outdated Show resolved Hide resolved
Copy link
Contributor Author

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JohnZed for the review, I have answered the comments. I have also updated the RAFT path.

cpp/src/svm/kernelcache.cuh Outdated Show resolved Hide resolved
cpp/src/svm/kernelcache.cuh Show resolved Hide resolved
cpp/src/svm/smoblocksolve.cuh Show resolved Hide resolved
cpp/test/sg/svc_test.cu Outdated Show resolved Hide resolved
@tfeher tfeher removed the 0 - Blocked Cannot progress due to external reasons label Jan 29, 2021
@codecov-io
Copy link

codecov-io commented Jan 30, 2021

Codecov Report

Merging #3420 (e113428) into branch-0.18 (550121b) will increase coverage by 0.24%.
The diff coverage is 85.45%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #3420      +/-   ##
===============================================
+ Coverage        71.48%   71.73%   +0.24%     
===============================================
  Files              207      212       +5     
  Lines            16748    17059     +311     
===============================================
+ Hits             11973    12237     +264     
- Misses            4775     4822      +47     
Impacted Files Coverage Δ
python/cuml/datasets/classification.py 80.80% <ø> (+0.19%) ⬆️
python/cuml/decomposition/incremental_pca.py 94.70% <ø> (ø)
python/cuml/svm/svc.pyx 95.16% <ø> (-0.62%) ⬇️
python/cuml/svm/svm_base.pyx 94.27% <ø> (-0.63%) ⬇️
python/cuml/dask/ensemble/base.py 20.27% <12.50%> (+0.95%) ⬆️
python/cuml/dask/cluster/kmeans.py 54.00% <33.33%> (ø)
python/cuml/dask/decomposition/base.py 39.53% <50.00%> (ø)
...ython/cuml/dask/ensemble/randomforestclassifier.py 30.00% <50.00%> (+0.51%) ⬆️
python/cuml/dask/ensemble/randomforestregressor.py 35.08% <50.00%> (+0.54%) ⬆️
python/cuml/dask/linear_model/linear_regression.py 59.09% <50.00%> (ø)
... and 48 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a22681c...e113428. Read the comment docs.

@tfeher
Copy link
Contributor Author

tfeher commented Feb 1, 2021

rerun tests

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I just updated RAFT to the latest commit (there is a tiny addition after what you had previously)

@JohnZed JohnZed added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 1, 2021
@JohnZed
Copy link
Contributor

JohnZed commented Feb 1, 2021

@gpucibot merge

@cjnolet
Copy link
Member

cjnolet commented Feb 2, 2021

Looks like a 404 error on one of the faiss packages:

13:19:24 conda.CondaMultiError: HTTP 404 NOT FOUND for url <https://conda.anaconda.org/rapidsai-nightly/linux-64/faiss-proc-1.0.0-cuda.tar.bz2>
13:19:24 Elapsed: 00:00.009541

I have not seen this in other PRs yet so I'm going to rerun tests.

@cjnolet
Copy link
Member

cjnolet commented Feb 2, 2021

rerun tests

1 similar comment
@tfeher
Copy link
Contributor Author

tfeher commented Feb 3, 2021

rerun tests

@tfeher
Copy link
Contributor Author

tfeher commented Feb 4, 2021

There is an unrelated error:

cuml/test/test_nearest_neighbors.py::test_ivfpq_pred[8-128-4000-False-4-16-8] Faiss assertion 'err == cudaSuccess' failed in void faiss::gpu::freeMemorySpace(faiss::gpu::MemorySpace, void*) at gpu/utils/MemorySpace.cpp:72; details: Failed to cudaFree pointer 0x7ff740def200 (error 700 an illegal memory access was encountered)
Fatal Python error: Aborted

I will merge branch-0.18 and have another try.

@rapids-bot rapids-bot bot merged commit b6e6450 into rapidsai:branch-0.18 Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working CMake libcuml non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] SVC fitting 240MB of data causes 32GB V100 to go out of memory
4 participants