-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort + limit topk optimization (initial) #893
Conversation
Codecov Report
@@ Coverage Diff @@
## main #893 +/- ##
==========================================
+ Coverage 75.25% 75.46% +0.21%
==========================================
Files 72 72
Lines 3786 3799 +13
Branches 675 678 +3
==========================================
+ Hits 2849 2867 +18
+ Misses 804 795 -9
- Partials 133 137 +4
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Based on the perf benchmarks above and the fact that I'm not quite sure on how we can decide on this cc: @charlesbluca |
Yeah, I think it makes sense to make this a Dask config option for now, that we can revisit if we're able to find a heuristic that reasonably decides an optimal |
…into sort-topk-optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ayushdg! A couple small comments:
@@ -3,7 +3,6 @@ | |||
import pytest | |||
|
|||
XFAIL_QUERIES = ( | |||
4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, do we know what in particular in this PR caused q4 to start passing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually looked into this a bit the errors from this query come from one of the rows having a non standard char C�TE D'IVOIRE
that arrow cannot render.
It impacts the Dask Dataframe version and only impacts the dask-cudf version if we try to print/repr it.
For whatever reason dask-cudf sort_values ended up invoking the repr
method in cudf which is a bit confusing. The nsmallest
api doesn't causing the repr function to be invoked allowing the query to pass
../../dask_sql/physical/utils/sort.py:36: in apply_sort
return df.sort_values(
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/contextlib.py:79: in inner
return func(*args, **kwds)
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/dask_cudf/core.py:249: in sort_values
df = sorting.sort_values(
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/contextlib.py:79: in inner
return func(*args, **kwds)
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/dask_cudf/sorting.py:277: in sort_values
partitions = df[by].map_partitions(
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/dask/dataframe/core.py:872: in map_partitions
return map_partitions(func, self, *args, **kwargs)
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/dask/dataframe/core.py:6610: in map_partitions
token = tokenize(func, meta, *args, **kwargs)
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/dask/base.py:933: in tokenize
hasher.update(str(normalize_token(kwargs)).encode())
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/dask/utils.py:640: in __call__
return meth(arg, *args, **kwargs)
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/dask/base.py:961: in normalize_dict
return normalize_token(sorted(d.items(), key=str))
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/contextlib.py:79: in inner
return func(*args, **kwds)
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/cudf/core/dataframe.py:1880: in __repr__
return self._clean_renderable_dataframe(output)
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/cudf/core/dataframe.py:1758: in _clean_renderable_dataframe
output = output.to_pandas().to_string(
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/contextlib.py:79: in inner
return func(*args, **kwds)
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/cudf/core/dataframe.py:4813: in to_pandas
out_data[i] = self._data[col_key].to_pandas(
/datasets/adattagupta/miniconda3/envs/dask-sql-rust2/lib/python3.9/site-packages/cudf/core/column/string.py:5475: in to_pandas
pd_series = self.to_arrow().to_pandas(**kwargs)
pyarrow/array.pxi:823: in pyarrow.lib._PandasConvertible.to_pandas
???
pyarrow/array.pxi:1396: in pyarrow.lib.Array._to_pandas
???
pyarrow/array.pxi:1597: in pyarrow.lib._array_like_to_pandas
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E pyarrow.lib.ArrowException: Unknown error: Wrapping C�TE D'IVOIRE failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay that makes sense - looks like what's happening here is that as part of Dask's sorting algorithm, we pass a dataframe of quantile division to map_partitions
, which Dask then attempts to tokenize using a string representation of the frame.
Co-authored-by: Charles Blackmon-Luca <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ayushdg! LGTM
This pr attempts to optimize sql queries that have a combination of
ORDER BY
followed byLIMIT
to use thesmallest/nlargest
api which does a partition wise sort+limit+combine instead of a full shuffle and can lead to significant performance improvements.The implementation is currently only limited to cases where all sort columns are either
ASCENDING NULLS LAST(default)
orDESCENDING NULLS LAST
, and does not work when sorting onobject
columns such as strings for pandas backed dataframes.Todo:
Here are some benchmarks carried out on my workstation on a sample timeseries dataset with
14.4M rows X 4 columns
sorting on an int columnValues represent
wall time in seconds
This PR
Main