Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Dask cuDF groupby-aggregations slower with UCX, particularly with split_out #403

Closed
beckernick opened this issue Jan 29, 2020 · 1 comment

Comments

@beckernick
Copy link
Member

beckernick commented Jan 29, 2020

Perhaps related to #402 , we see slowdowns with groupby-aggregations when using UCX. This is particularly relevant when using split_out, which is often necessary for at-scale groupby-aggregations with many unique keys.

I get these loose time results on a DGX-2:

%%time

res = df.groupby(['a_0', 'a_1']).a_non_merge_2.mean()
res = res.persist()
_ = wait(res)

TCP

  • 3.75s

UCX + NVLink

  • 6.41
%%time

res = df.groupby(['a_0', 'a_1']).a_non_merge_2.mean(split_out=16)
res = res.persist()
_ = wait(res)

TCP

  • 7.87s

UCX + NVLink

  • 19.9s

Code to reproduce:

import dask_cudf
import cudf
import os
import time
import dask.dataframe as dd
import dask.array as da

from dask_cuda import LocalCUDACluster
from dask.distributed import Client,wait
from dask.utils import parse_bytes
from dask.distributed import performance_report

protocol = "tcp"
enable_nvlink = False

# protocol = "ucx"
# enable_nvlink = True


def create_random_data(n_rows=1_000, n_parts = 10, n_keys_index_1=100_000,n_keys_index_2=100,n_keys_index_3=100, col_prefix = 'a'):
    
    chunks = n_rows//n_parts

    df = dd.concat([
        da.random.random(n_rows, chunks = chunks).to_dask_dataframe(columns= col_prefix + '_non_merge_1'),
        da.random.random(n_rows, chunks = chunks).to_dask_dataframe(columns= col_prefix + '_non_merge_2'),
        da.random.random(n_rows, chunks = chunks).to_dask_dataframe(columns= col_prefix + '_non_merge_3'),
        da.random.randint(0, n_keys_index_1, size=n_rows,chunks = chunks ).to_dask_dataframe(columns= col_prefix + '_0'),
        da.random.randint(0, n_keys_index_2, size=n_rows, chunks = chunks ).to_dask_dataframe(columns= col_prefix +'_1'),
        da.random.randint(0, n_keys_index_3, size=n_rows, chunks = chunks ).to_dask_dataframe(columns= col_prefix +'_2'),
        
    ], axis=1).persist()
    
    gdf = df.map_partitions(cudf.from_pandas)
    gdf =  gdf.persist()
    _ = wait(gdf)
    return gdf


def setup_rmm_pool(client):
    client.run(
        cudf.set_allocator,
        pool=True,
        initial_pool_size= parse_bytes("26GB"),
        allocator="default"
    )
    return None

cluster = LocalCUDACluster(
    protocol=protocol,
    enable_nvlink=enable_nvlink
)
client = Client(cluster)

setup_rmm_pool(client)

rows, parts = 10_000_000, 250
df = create_random_data(n_rows=rows, n_parts=parts, col_prefix='a', n_keys_index_2=10000)
res = df.groupby(['a_0', 'a_1']).a_non_merge_2.mean(split_out=16)
res = res.persist()
_ = wait(res)

Relevant Conda Environment:

cudf                      0.13.0a200129          py37_748    rapidsai-nightly
dask                      2.10.0                     py_0    conda-forge
dask-core                 2.10.0                     py_0    conda-forge
dask-cuda                 0.13.0a200129            py37_2    rapidsai-nightly
dask-cudf                 0.13.0a200129          py37_748    rapidsai-nightly
dask-xgboost              0.2.0.dev28      cuda10.1py37_0    rapidsai-nightly
distributed               2.10.0                     py_0    conda-forge
libcudf                   0.13.0a200129      cuda10.1_743    rapidsai-nightly
ucx                       1.7.0dev+g9d06c3a    cuda10.1_129    rapidsai-nightly
ucx-py                    0.13.0a200123+g896e60b         py37_12    rapidsai-nightly
@beckernick beckernick changed the title Dask cuDF groupby-aggregations slower with UCX, particularly with split_out [BUG] Dask cuDF groupby-aggregations slower with UCX, particularly with split_out Jan 29, 2020
@pentschev
Copy link
Member

This indeed used to be an issue but I believe it was resolved by UCX upstream. New numbers with RAPIDS 0.20 and UCX 1.9:

TCP: 1.07s
UCX: 1.36s

TCP split_out=16: 1.97s
UCX split_out=16: 1.95s

It seems that for this particular workflow there isn't much gain with UCX, but the underlying issue seems to be resolved. Closing this for now, but please reopen if you encounter that again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants