Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scanpy_funcs: Use multi-target regression #104

Conversation

ahendriksen
Copy link

cuML will gain support for multi-target regression in version 22.12. This speeds up the code significantly since the regression no longer has to be performed inside a for loop. As described in rapidsai/cuml#4988, the speedup can be up to 50x.

This PR adds support for using the multi-target regression. To prevent running out of memory when the input is too large (more than 100000 rows), the existing iterative code path is used.

cuML gained support for multi-target regression in version 22.12. This
removes the need for a Python for loop and speeds up the code
significantly.

This commit adds support for using the multi-target regression. To
prevent running out of memory when the input is too large (more than
100000 rows), the existing iterative code path is used.
@ahendriksen ahendriksen force-pushed the fea-multi-target-regression branch from 99eef26 to 3780943 Compare November 17, 2022 14:31
@ahendriksen
Copy link
Author

For a representative use case, this PR speeds up the regression step by a factor of ~20.

import cupy as cp
from cuml.linear_model import LinearRegression
from time import perf_counter as timer
from contextlib import contextmanager

@contextmanager
def bench(name):
    # Code to acquire resource, e.g.:
    start = timer()
    yield
    duration = timer() - start
    print(f"{name}: {duration:0.2f} seconds")

n_cells = 91_000
n_genes = 5_000

normalized = cp.random.normal(size=(n_cells, n_genes))
n_counts = cp.random.normal(size=n_cells)
percent_mito =  cp.random.normal(size=n_cells)

with bench("current regress_out"):
    regress_out(normalized, n_counts, percent_mito, verbose=True) # current

with bench("new regress_out"):
    regress_out(normalized, n_counts, percent_mito, verbose=True) # PR

Results on a Volta V100 GPU:

Regressed 0 out of 5000
[... snip ...]
Regressed 4500 out of 5000

current regress_out: 16.75 seconds
new regress_out:      0.73 seconds

@Intron7
Copy link
Contributor

Intron7 commented Nov 30, 2022

Dear Allard,
great work this works amazingly well.
Do you think it would also work if we created a chunck_size parameter to run like 100 genes at a time even for lager than 100k datasets?

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahendriksen this repository hasn't been updated for RAPIDS 22.12 yet. Can you update the environment yaml files in conda/environment/*.yaml and the Dockerfile and verify the notebooks still run successfully?

Also, @Intron7 has a valid point- it would be great to still maintain some level of chunking even w/ the multiple outputs support.

@Intron7
Copy link
Contributor

Intron7 commented Nov 30, 2022

So I implement this chunking already for my repo rapids_singlecell and it works really well. I'm still figuring out the details of the syntax, because its still pretty clunky
I could write something similar for @ahendriksen PR.

@cjnolet
Copy link
Member

cjnolet commented Dec 19, 2022

I could write something similar for @ahendriksen PR.

@Intron7 if you have time to do that, it would certainly be really useful to show off.

@Intron7
Copy link
Contributor

Intron7 commented Dec 19, 2022

@cjnolet @ahendriksen done.

I created a PR for @ahendriksen branch with the batching update.

I have to test tomorrow if we have to change the code for the multigpu notebook to set the batchsize argument there to None. The default is now a batchsize of 100. But i can change that to whatever you want.

still need to test if this works with multi-GPU in the batched format.
@ahendriksen
Copy link
Author

Thank you for adding the batching update @Intron7!

@ahendriksen this repository hasn't been updated for RAPIDS 22.12 yet. Can you update the environment yaml files in conda/environment/*.yaml and the Dockerfile and verify the notebooks still run successfully?

I have tested that the code / notebook in this PR runs successfully on RAPIDS 22.12. I have not tested if all other notebooks continue to run successfully.

@Intron7
Copy link
Contributor

Intron7 commented Jan 10, 2023

@ahendriksen @cjnolet Perfect. I think this in a hugh improvement since this was one of the slowest parts of the analysis

@cjnolet
Copy link
Member

cjnolet commented Jan 10, 2023

I have to test tomorrow if we have to change the code for the multigpu notebook to set the batchsize argument there to None. The default is now a batchsize of 100. But i can change that to whatever you want.

@Intron7 were you able to check this? Does it run successfully for you? If so, I'm okay merging this and I can update the conda yaml file in a follow-on PR.

@Intron7
Copy link
Contributor

Intron7 commented Jan 10, 2023

I found one error within the UVM 1million notebook. Its a memory error with cupy and the cubin hash in the scale function. The standard scaler function works but is a lot slower.

---------------------------------------------------------------------------
CUDADriverError                           Traceback (most recent call last)
File <timed exec>:1

File cupy/_core/core.pyx:1112, in cupy._core.core._ndarray_base.mean()

File cupy/_core/core.pyx:1121, in cupy._core.core._ndarray_base.mean()

File cupy/_core/_routines_statistics.pyx:148, in cupy._core._routines_statistics._ndarray_mean()

File cupy/_core/_kernel.pyx:1310, in cupy._core._kernel.ufunc.__call__()

File cupy/_core/_kernel.pyx:1337, in cupy._core._kernel.ufunc._get_ufunc_kernel()

File cupy/_core/_kernel.pyx:1029, in cupy._core._kernel._get_ufunc_kernel()

File cupy/_core/_kernel.pyx:72, in cupy._core._kernel._get_simple_elementwise_kernel()

File cupy/_core/core.pyx:2232, in cupy._core.core.compile_with_cache()

File ~/conda/envs/rapids-22.12/lib/python3.9/site-packages/cupy/cuda/compiler.py:493, in _compile_module_with_cache(source, options, arch, cache_dir, extra_source, backend, enable_cooperative_groups, name_expressions, log_stream, jitify)
    489     return _compile_with_cache_hip(
    490         source, options, arch, cache_dir, extra_source, backend,
    491         name_expressions, log_stream, cache_in_memory)
    492 else:
--> 493     return _compile_with_cache_cuda(
    494         source, options, arch, cache_dir, extra_source, backend,
    495         enable_cooperative_groups, name_expressions, log_stream,
    496         cache_in_memory, jitify)

File ~/conda/envs/rapids-22.12/lib/python3.9/site-packages/cupy/cuda/compiler.py:562, in _compile_with_cache_cuda(source, options, arch, cache_dir, extra_source, backend, enable_cooperative_groups, name_expressions, log_stream, cache_in_memory, jitify)
    560             cubin_hash = _hash_hexdigest(cubin).encode('ascii')
    561             if hash == cubin_hash:
--> 562                 mod.load(cubin)
    563                 return mod
    564 else:
    565     # Enforce compiling -- the resulting kernel will be cached elsewhere,
    566     # so we do nothing

File cupy/cuda/function.pyx:264, in cupy.cuda.function.Module.load()

File cupy/cuda/function.pyx:266, in cupy.cuda.function.Module.load()

File cupy_backends/cuda/api/driver.pyx:210, in cupy_backends.cuda.api.driver.moduleLoadData()

File cupy_backends/cuda/api/driver.pyx:60, in cupy_backends.cuda.api.driver.check_status()

CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

@Intron7
Copy link
Contributor

Intron7 commented Jan 10, 2023

@cjnolet The Multi_GPU notebook runs perfectly fine with the default of running 100 genes at once during regress_out.

@cjnolet
Copy link
Member

cjnolet commented Jan 10, 2023

@Intron7 what hardware did you use to test the UVM notebook? I've found that illegal memory error can commonly happen when 1) precision overflow for an indexing type (like using an int for nnz in sparse data) or 2) when the memory is oversubscribed by too much.

@Intron7
Copy link
Contributor

Intron7 commented Jan 10, 2023

@cjnolet I ran it on a node with an 80GB A100 PCIe and on one of the 80GB A100s in the dgx both the same error.

@cjnolet
Copy link
Member

cjnolet commented Jan 10, 2023

Thanks for verifying that, @Intron7! I'll try and take a little deeper look as well

@Intron7
Copy link
Contributor

Intron7 commented Jan 10, 2023

Thanks for verifying that, @Intron7! I'll try and take a little deeper look as well

It could also be our badly setup GPUs. So if this notebooks work on your end I think its us.

@cjnolet cjnolet merged commit 5ba1227 into NVIDIA-Genomics-Research:master Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants