Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Unpin dask & distributed for development #10623

Merged
merged 10 commits into from
Apr 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions ci/benchmark/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ export GBENCH_BENCHMARKS_DIR="$WORKSPACE/cpp/build/gbenchmarks/"
export LIBCUDF_KERNEL_CACHE_PATH="$HOME/.jitify-cache"

# Dask & Distributed option to install main(nightly) or `conda-forge` packages.
export INSTALL_DASK_MAIN=0
export INSTALL_DASK_MAIN=1

function remove_libcudf_kernel_cache_dir {
EXITCODE=$?
Expand Down Expand Up @@ -82,8 +82,8 @@ if [[ "${INSTALL_DASK_MAIN}" == 1 ]]; then
gpuci_logger "gpuci_mamba_retry update dask"
gpuci_mamba_retry update dask
else
gpuci_logger "gpuci_mamba_retry install conda-forge::dask==2022.03.0 conda-forge::distributed==2022.03.0 conda-forge::dask-core==2022.03.0 --force-reinstall"
gpuci_mamba_retry install conda-forge::dask==2022.03.0 conda-forge::distributed==2022.03.0 conda-forge::dask-core==2022.03.0 --force-reinstall
gpuci_logger "gpuci_mamba_retry install conda-forge::dask>=2022.03.0 conda-forge::distributed>=2022.03.0 conda-forge::dask-core>=2022.03.0 --force-reinstall"
gpuci_mamba_retry install conda-forge::dask>=2022.03.0 conda-forge::distributed>=2022.03.0 conda-forge::dask-core>=2022.03.0 --force-reinstall
Comment on lines +85 to +86
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wondering if a less cumbersome way of achieving this would be to manually remove dask/label/dev from our list of channels here? Maybe something like

Suggested change
gpuci_logger "gpuci_mamba_retry install conda-forge::dask>=2022.03.0 conda-forge::distributed>=2022.03.0 conda-forge::dask-core>=2022.03.0 --force-reinstall"
gpuci_mamba_retry install conda-forge::dask>=2022.03.0 conda-forge::distributed>=2022.03.0 conda-forge::dask-core>=2022.03.0 --force-reinstall
gpuci_logger "gpuci_mamba_retry install dask>=2022.03.0 --force-reinstall"
gpuc_conda_retry config --remove channels dask/label/dev
gpuci_mamba_retry install dask>=2022.03.0 --force-reinstall

Not sure if removing the channel would persist in other runs though, cc @ajschmidt8 @jakirkham if this solution makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember trying something similar previously but didn't seem to work, testing it now again in CI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree. Recall Prem tried several things, but don't recall if this one (and if so whether we encountered any issues with that)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this doesn't work, happy to merge in as is since the current setup is working

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this actually install Dask main or just the latest release?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use nightlies from the Dask channel in some cases. Depends if the channel dask/label/dev is used

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do nightlies really mean "once every night", or does it mean it's a new package after every merge?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every merge. They are equivalent to using main, but also include Dask's dependencies

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for confirming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fi

# Install the master version of streamz
Expand Down
6 changes: 3 additions & 3 deletions ci/gpu/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ export GIT_DESCRIBE_TAG=`git describe --tags`
export MINOR_VERSION=`echo $GIT_DESCRIBE_TAG | grep -o -E '([0-9]+\.[0-9]+)'`

# Dask & Distributed option to install main(nightly) or `conda-forge` packages.
export INSTALL_DASK_MAIN=0
export INSTALL_DASK_MAIN=1

# ucx-py version
export UCX_PY_VERSION='0.26.*'
Expand Down Expand Up @@ -112,8 +112,8 @@ function install_dask {
gpuci_mamba_retry update dask
conda list
else
gpuci_logger "gpuci_mamba_retry install conda-forge::dask==2022.03.0 conda-forge::distributed==2022.03.0 conda-forge::dask-core==2022.03.0 --force-reinstall"
gpuci_mamba_retry install conda-forge::dask==2022.03.0 conda-forge::distributed==2022.03.0 conda-forge::dask-core==2022.03.0 --force-reinstall
gpuci_logger "gpuci_mamba_retry install conda-forge::dask>=2022.03.0 conda-forge::distributed>=2022.03.0 conda-forge::dask-core>=2022.03.0 --force-reinstall"
gpuci_mamba_retry install conda-forge::dask>=2022.03.0 conda-forge::distributed>=2022.03.0 conda-forge::dask-core>=2022.03.0 --force-reinstall
fi
# Install the main version of streamz
gpuci_logger "Install the main version of streamz"
Expand Down
4 changes: 2 additions & 2 deletions conda/environments/cudf_dev_cuda11.5.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ dependencies:
- pydocstyle=6.1.1
- typing_extensions
- pre-commit
- dask==2022.03.0
- distributed==2022.03.0
- dask>=2022.03.0
- distributed>=2022.03.0
- streamz
- arrow-cpp=7.0.0
- dlpack>=0.5,<0.6.0a0
Expand Down
4 changes: 2 additions & 2 deletions conda/recipes/custreamz/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ requirements:
- python
- streamz
- cudf {{ version }}
- dask==2022.03.0
- distributed==2022.03.0
- dask>=2022.03.0
- distributed>=2022.03.0
- python-confluent-kafka >=1.7.0,<1.8.0a0
- cudf_kafka {{ version }}

Expand Down
8 changes: 4 additions & 4 deletions conda/recipes/dask-cudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,14 @@ requirements:
host:
- python
- cudf {{ version }}
- dask==2022.03.0
- distributed==2022.03.0
- dask>=2022.03.0
- distributed>=2022.03.0
- cudatoolkit {{ cuda_version }}
run:
- python
- cudf {{ version }}
- dask==2022.03.0
- distributed==2022.03.0
- dask>=2022.03.0
- distributed>=2022.03.0
- {{ pin_compatible('cudatoolkit', max_pin='x', min_pin='x') }}

test: # [linux64]
Expand Down
10 changes: 6 additions & 4 deletions python/dask_cudf/dask_cudf/backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,10 +142,12 @@ def meta_nonempty_cudf(x):
res = cudf.DataFrame(index=idx)
for col in x._data.names:
dtype = str(x._data[col].dtype)
if dtype in ("list", "struct"):
# Not possible to hash and store list & struct types
# as they can contain different levels of nesting or
# fields.
if dtype in ("list", "struct", "category"):
charlesbluca marked this conversation as resolved.
Show resolved Hide resolved
# 1. Not possible to hash and store list & struct types
# as they can contain different levels of nesting or
# fields.
# 2. Not possible to has `category` types as
# they often contain an underlying types to them.
res._data[col] = _get_non_empty_data(x._data[col])
else:
if dtype not in columns_with_dtype:
Expand Down
2 changes: 1 addition & 1 deletion python/dask_cudf/dask_cudf/sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,6 @@ def sort_values(
df4 = df3.map_partitions(sort_function, **sort_kwargs)
if not isinstance(divisions, gd.DataFrame) and set_divisions:
# Can't have multi-column divisions elsewhere in dask (yet)
df4.divisions = methods.tolist(divisions)
df4.divisions = tuple(methods.tolist(divisions))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is required due to this upstream change: dask/dask#8806


return df4
4 changes: 2 additions & 2 deletions python/dask_cudf/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@

install_requires = [
"cudf",
"dask==2022.03.0",
"distributed==2022.03.0",
"dask>=2022.03.0",
"distributed>=2022.03.0",
"fsspec>=0.6.0",
"numpy",
"pandas>=1.0,<1.4.0dev0",
Expand Down