-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Review] Multi-column merge and switch between hash and sort-based approaches #137
Conversation
…d values are 'sort' or 'hash'. Moved the original join code to to the _sortjoin() function that is called when type=='sort'. Created an stubbed _hashjoin() function for when type='hash'. Changes should not break existing functionality, but pytest tests are currently failing for unknown reasons.
…ns. Removed the 'left_on' and 'right_on' parameters. _hashjoin is still stubbed.
…column join. Currently the call to apply_join fails in the test_joining because it expects the inputs to be a List of NumericalColumns, whereas a List of Series are being passed.
… the join kernels results with cudautils.copy_array() for the left_indices and right_indices is required because currently apply_join frees the memory of its result. This seems inefficient. We'd like joined_indices to be able to reference the original apply_join result's device memory and have it freed when joined_indices goes out of scope.
Depends on rapidsai/libgdf#52 |
pygdf/numerical.py
Outdated
if other is None: | ||
return self._stub_merge(left_on=left_on, right_on=right_on, | ||
how=how, return_indexers=return_indexers) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this chunk be removed? If other
isn't defined or is defined as None we should fall to the next block that raises a TypeError that other
is not compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be removed. other
can never be None
for this method. Also, _stub_merge
can be removed, too.
pygdf/dataframe.py
Outdated
@@ -594,6 +594,144 @@ def _n_largest_or_smallest(self, method, n, columns, keep): | |||
df[k] = self[k].reset_index().take(new_positions) | |||
return df.set_index(self.index.take(new_positions)) | |||
|
|||
def merge(self, other, on=None, how='multi-left', lsuffix='_x', rsuffix='_y'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how
should not expose "multi-left" to be more compatible with pandas. Also, the test_multi.py::test_dataframe_multi_column_join
is failing because it is passing in left
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sklam I don't think we should strive to maintain Pandas compatibility, though I agree we should have 'left' instead of 'multi-left' but we need another parameter to specify whether to use a sort based join or a hash based join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kkraus14, I agree. When the parameters do overlap with pandas, we should keep them compatible. But, it is okay to add new parameters that pandas do not have.
There are some coding style issues i.e. missing whitespaces around operators, line too long, unnecessary parentheses in |
pygdf/numerical.py
Outdated
|
||
def _hashjoin(self, other, how='left', return_indexers=False): | ||
|
||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better an exception here if it is not yet implemented; i.e. raise NotImplementedError
pygdf/tests/test_joining.py
Outdated
|
||
pd.util.testing.assert_frame_equal(join_result.to_pandas().sort_values(list(pddf_joined.columns)).reset_index(drop=True), pddf_joined) | ||
|
||
# def normalize(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these commented lines can be removed.
Adding notes: this PR depends on rapidsai/libgdf#52 |
pygdf/dataframe.py
Outdated
#joined_values, joined_indicies = self._stub_merge( | ||
# lhs, rhs, left_on=on, right_on=on, how=how, return_joined_indicies=True | ||
# ) | ||
joined_values, joined_indicies = self._merge_gdf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_merge_gdf
can't handle empty dataframe. It leads to a std::bad_alloc
in libgdf. I am fixing it in my branch with sklam@6f2f574, which is needed for rapidsai/dask-cudf#22.
@sklam I’m ok if you push changes directly to this PR |
# Conflicts: # pygdf/dataframe.py # pygdf/numerical.py # pygdf/tests/test_joining.py
pygdf/dataframe.py
Outdated
@@ -594,6 +594,144 @@ def _n_largest_or_smallest(self, method, n, columns, keep): | |||
df[k] = self[k].reset_index().take(new_positions) | |||
return df.set_index(self.index.take(new_positions)) | |||
|
|||
def merge(self, other, on=None, how='multi-left', lsuffix='_x', rsuffix='_y'): | |||
if how != 'multi-left': | |||
raise ValueError('{!r} join not implemented yet'.format(how)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a NotImplementedError
pygdf/dataframe.py
Outdated
@@ -621,7 +759,7 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='', | |||
- *other* must be a single DataFrame for now. | |||
- *on* is not supported yet due to lack of multi-index support. | |||
""" | |||
if how not in 'left,right,inner,outer': | |||
if how not in ['left', 'right', 'inner', 'outer']: | |||
raise ValueError('unsupported {!r} join'.format(how)) | |||
if on is not None: | |||
raise ValueError('"on" is not supported yet') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a NotImplementedError
pygdf/_gdf.py
Outdated
@@ -136,6 +136,7 @@ def apply_sort(col_keys, col_vals, ascending=True): | |||
'left': libgdf.gdf_left_join_generic, | |||
'inner': libgdf.gdf_inner_join_generic, | |||
'outer': libgdf.gdf_outer_join_generic, | |||
'multi-left': libgdf.gdf_multi_left_join_generic, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make 'left' default to the libgdf.gdf_multi_left_join_generic
and have a 'left-compat' that maps to libgdf.gdf_left_join_generic
.
pygdf/_gdf.py
Outdated
# Call libgdf | ||
joiner(col_lhs.cffi_view, col_rhs.cffi_view, join_result_ptr) | ||
|
||
if(how=='multi-left'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the change above this should be 'left'
pygdf/numerical.py
Outdated
raise ValueError('Unsupported join type') | ||
|
||
def _hashjoin(self, other, how='left', return_indexers=False): | ||
msg = "numerical hash join not implemented yet." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message doesn't seem right. Is this for joining on the index? If so lets update the error message to "Hash joins on index not implemented yet."
pygdf/tests/test_joining.py
Outdated
# right_pos1 = pd.Series(right_pos1).fillna(magic).tolist() | ||
# right_pos2 = pd.Series(right_pos2).fillna(magic).tolist() | ||
# assert tuple(left_pos1)==tuple(left_pos2) | ||
# assert tuple(right_pos1)==tuple(right_pos2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this block of comments.
Looks good to me and the build failed on something completely unrelated. Restarted it and upon successful completion I'll merge. Thanks! EDIT: Still flake8 issues that need to be resolved. |
Looks good to me! Merging. Thanks all! |
[REVIEW] Example code documentation template
…de (#137) This PR resolves an issue in `from_pandas` API where `nan`'s present in a pandas series are being converted to `<NA>` values in `cudf` resulting in incorrect column representation: ```python In [1]: import pandas as pd In [2]: import numpy as np In [3]: import cudf In [4]: s = pd.Series([np.nan, "a", "b"]) In [5]: s Out[5]: 0 NaN 1 a 2 b dtype: object In [6]: cudf.set_option("mode.pandas_compatible", True) In [7]: gs = cudf.from_pandas(s) In [8]: gs Out[8]: 0 <NA> 1 a 2 b dtype: object In [9]: gs = cudf.from_pandas(s, nan_as_null=False) --------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) Cell In[9], line 1 ----> 1 gs = cudf.from_pandas(s, nan_as_null=False) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs) 112 @wraps(func) 113 def inner(*args, **kwargs): 114 libnvtx_push_range(self.attributes, self.domain.handle) --> 115 result = func(*args, **kwargs) 116 libnvtx_pop_range(self.domain.handle) 117 return result File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py:7874, in from_pandas(obj, nan_as_null) 7872 return DataFrame.from_pandas(obj, nan_as_null=nan_as_null) 7873 elif isinstance(obj, pd.Series): -> 7874 return Series.from_pandas(obj, nan_as_null=nan_as_null) 7875 elif isinstance(obj, pd.MultiIndex): 7876 return MultiIndex.from_pandas(obj, nan_as_null=nan_as_null) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs) 112 @wraps(func) 113 def inner(*args, **kwargs): 114 libnvtx_push_range(self.attributes, self.domain.handle) --> 115 result = func(*args, **kwargs) 116 libnvtx_pop_range(self.domain.handle) 117 return result File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/series.py:731, in Series.from_pandas(cls, s, nan_as_null) 729 with warnings.catch_warnings(): 730 warnings.simplefilter("ignore") --> 731 result = cls(s, nan_as_null=nan_as_null) 732 return result File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs) 112 @wraps(func) 113 def inner(*args, **kwargs): 114 libnvtx_push_range(self.attributes, self.domain.handle) --> 115 result = func(*args, **kwargs) 116 libnvtx_pop_range(self.domain.handle) 117 return result File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/series.py:651, in Series.__init__(self, data, index, dtype, name, copy, nan_as_null) 633 if not isinstance(data, ColumnBase): 634 # Using `getattr_static` to check if 635 # `data` is on device memory and perform (...) 641 # be expensive or mark a buffer as 642 # unspillable. 643 has_cai = ( 644 type( 645 inspect.getattr_static( (...) 649 is property 650 ) --> 651 data = column.as_column( 652 data, 653 nan_as_null=nan_as_null, 654 dtype=dtype, 655 length=len(index) if index is not None else None, 656 ) 657 if copy and has_cai: 658 data = data.copy(deep=True) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/column/column.py:2073, in as_column(arbitrary, nan_as_null, dtype, length) 2069 if cudf.get_option( 2070 "mode.pandas_compatible" 2071 ) and _is_pandas_nullable_extension_dtype(arbitrary.dtype): 2072 raise NotImplementedError("not supported") -> 2073 pyarrow_array = pa.array(arbitrary, from_pandas=nan_as_null) 2074 if arbitrary.dtype == cudf.dtype("object") and cudf.dtype( 2075 pyarrow_array.type.to_pandas_dtype() 2076 ) != cudf.dtype(arbitrary.dtype): 2077 raise MixedTypeError("Cannot create column with mixed types") File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/array.pxi:323, in pyarrow.lib.array() File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/array.pxi:83, in pyarrow.lib._ndarray_to_array() File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status() ArrowInvalid: Could not convert 'a' with type str: tried to convert to double In [10]: gs = cudf.from_pandas(s, nan_as_null=True) In [11]: gs.to_pandas() Out[11]: 0 None # In `cudf.pandas` this is problematic because we started with `np.nan` and now ending with `None`. 1 a 2 b dtype: object ``` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai/cudf-private#137
This pull request adds the following:
merge
that calls the multi-column libgdf codetest_dataframe_multi_column_join
that checks themerge
functionjoin
that allows to switch between the hash and sort-based implementations