Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Review] Multi-column merge and switch between hash and sort-based approaches #137

Merged
merged 23 commits into from
Jul 18, 2018

Conversation

nsakharnykh
Copy link
Collaborator

This pull request adds the following:

  • a new function merge that calls the multi-column libgdf code
  • a new test test_dataframe_multi_column_join that checks the merge function
  • a new type argument in join that allows to switch between the hash and sort-based implementations

yashv28 and others added 15 commits June 13, 2018 16:08
…d values are 'sort' or 'hash'. Moved the original join code to to the _sortjoin() function that is called when type=='sort'. Created an stubbed _hashjoin() function for when type='hash'. Changes should not break existing functionality, but pytest tests are currently failing for unknown reasons.
…ns. Removed the 'left_on' and 'right_on' parameters. _hashjoin is still stubbed.
…column join. Currently the call to apply_join fails in the test_joining because it expects the inputs to be a List of NumericalColumns, whereas a List of Series are being passed.
… the join kernels results with cudautils.copy_array() for the left_indices and right_indices is required because currently apply_join frees the memory of its result. This seems inefficient. We'd like joined_indices to be able to reference the original apply_join result's device memory and have it freed when joined_indices goes out of scope.
@kkraus14
Copy link
Collaborator

Depends on rapidsai/libgdf#52

if other is None:
return self._stub_merge(left_on=left_on, right_on=right_on,
how=how, return_indexers=return_indexers)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this chunk be removed? If other isn't defined or is defined as None we should fall to the next block that raises a TypeError that other is not compatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be removed. other can never be None for this method. Also, _stub_merge can be removed, too.

@@ -594,6 +594,144 @@ def _n_largest_or_smallest(self, method, n, columns, keep):
df[k] = self[k].reset_index().take(new_positions)
return df.set_index(self.index.take(new_positions))

def merge(self, other, on=None, how='multi-left', lsuffix='_x', rsuffix='_y'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how should not expose "multi-left" to be more compatible with pandas. Also, the test_multi.py::test_dataframe_multi_column_join is failing because it is passing in left.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sklam I don't think we should strive to maintain Pandas compatibility, though I agree we should have 'left' instead of 'multi-left' but we need another parameter to specify whether to use a sort based join or a hash based join.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14, I agree. When the parameters do overlap with pandas, we should keep them compatible. But, it is okay to add new parameters that pandas do not have.

@sklam
Copy link
Contributor

sklam commented Jun 27, 2018

There are some coding style issues i.e. missing whitespaces around operators, line too long, unnecessary parentheses in if. I can make the coding style changes if that's okay by @nsakharnykh .


def _hashjoin(self, other, how='left', return_indexers=False):

return
Copy link
Contributor

@sklam sklam Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better an exception here if it is not yet implemented; i.e. raise NotImplementedError


pd.util.testing.assert_frame_equal(join_result.to_pandas().sort_values(list(pddf_joined.columns)).reset_index(drop=True), pddf_joined)

# def normalize(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these commented lines can be removed.

@sklam
Copy link
Contributor

sklam commented Jun 29, 2018

Adding notes: this PR depends on rapidsai/libgdf#52

#joined_values, joined_indicies = self._stub_merge(
# lhs, rhs, left_on=on, right_on=on, how=how, return_joined_indicies=True
# )
joined_values, joined_indicies = self._merge_gdf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_merge_gdf can't handle empty dataframe. It leads to a std::bad_alloc in libgdf. I am fixing it in my branch with sklam@6f2f574, which is needed for rapidsai/dask-cudf#22.

@nsakharnykh
Copy link
Collaborator Author

@sklam I’m ok if you push changes directly to this PR

@@ -594,6 +594,144 @@ def _n_largest_or_smallest(self, method, n, columns, keep):
df[k] = self[k].reset_index().take(new_positions)
return df.set_index(self.index.take(new_positions))

def merge(self, other, on=None, how='multi-left', lsuffix='_x', rsuffix='_y'):
if how != 'multi-left':
raise ValueError('{!r} join not implemented yet'.format(how))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a NotImplementedError

@@ -621,7 +759,7 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
- *other* must be a single DataFrame for now.
- *on* is not supported yet due to lack of multi-index support.
"""
if how not in 'left,right,inner,outer':
if how not in ['left', 'right', 'inner', 'outer']:
raise ValueError('unsupported {!r} join'.format(how))
if on is not None:
raise ValueError('"on" is not supported yet')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a NotImplementedError

@nsakharnykh nsakharnykh changed the title Multi-column merge and switch between hash and sort-based approaches [Review] Multi-column merge and switch between hash and sort-based approaches Jul 16, 2018
pygdf/_gdf.py Outdated
@@ -136,6 +136,7 @@ def apply_sort(col_keys, col_vals, ascending=True):
'left': libgdf.gdf_left_join_generic,
'inner': libgdf.gdf_inner_join_generic,
'outer': libgdf.gdf_outer_join_generic,
'multi-left': libgdf.gdf_multi_left_join_generic,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make 'left' default to the libgdf.gdf_multi_left_join_generic and have a 'left-compat' that maps to libgdf.gdf_left_join_generic.

pygdf/_gdf.py Outdated
# Call libgdf
joiner(col_lhs.cffi_view, col_rhs.cffi_view, join_result_ptr)

if(how=='multi-left'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the change above this should be 'left'

raise ValueError('Unsupported join type')

def _hashjoin(self, other, how='left', return_indexers=False):
msg = "numerical hash join not implemented yet."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message doesn't seem right. Is this for joining on the index? If so lets update the error message to "Hash joins on index not implemented yet."

# right_pos1 = pd.Series(right_pos1).fillna(magic).tolist()
# right_pos2 = pd.Series(right_pos2).fillna(magic).tolist()
# assert tuple(left_pos1)==tuple(left_pos2)
# assert tuple(right_pos1)==tuple(right_pos2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this block of comments.

@kkraus14
Copy link
Collaborator

kkraus14 commented Jul 17, 2018

Looks good to me and the build failed on something completely unrelated. Restarted it and upon successful completion I'll merge. Thanks!

EDIT: Still flake8 issues that need to be resolved.

@kkraus14 kkraus14 mentioned this pull request Jul 17, 2018
@kkraus14
Copy link
Collaborator

Looks good to me! Merging. Thanks all!

@kkraus14 kkraus14 merged commit dff90c7 into rapidsai:master Jul 18, 2018
mike-wendt pushed a commit that referenced this pull request Oct 26, 2018
[REVIEW] Example code documentation template
raydouglass pushed a commit that referenced this pull request Nov 7, 2023
…de (#137)

This PR resolves an issue in `from_pandas` API where `nan`'s present in a pandas series are being converted to `<NA>` values in `cudf` resulting in incorrect column representation:
```python
In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import cudf

In [4]: s = pd.Series([np.nan, "a", "b"])

In [5]: s
Out[5]: 
0    NaN
1      a
2      b
dtype: object

In [6]: cudf.set_option("mode.pandas_compatible", True)

In [7]: gs = cudf.from_pandas(s)

In [8]: gs
Out[8]: 
0    <NA>
1       a
2       b
dtype: object

In [9]: gs = cudf.from_pandas(s, nan_as_null=False)
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 gs = cudf.from_pandas(s, nan_as_null=False)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs)
    112 @wraps(func)
    113 def inner(*args, **kwargs):
    114     libnvtx_push_range(self.attributes, self.domain.handle)
--> 115     result = func(*args, **kwargs)
    116     libnvtx_pop_range(self.domain.handle)
    117     return result

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py:7874, in from_pandas(obj, nan_as_null)
   7872     return DataFrame.from_pandas(obj, nan_as_null=nan_as_null)
   7873 elif isinstance(obj, pd.Series):
-> 7874     return Series.from_pandas(obj, nan_as_null=nan_as_null)
   7875 elif isinstance(obj, pd.MultiIndex):
   7876     return MultiIndex.from_pandas(obj, nan_as_null=nan_as_null)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs)
    112 @wraps(func)
    113 def inner(*args, **kwargs):
    114     libnvtx_push_range(self.attributes, self.domain.handle)
--> 115     result = func(*args, **kwargs)
    116     libnvtx_pop_range(self.domain.handle)
    117     return result

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/series.py:731, in Series.from_pandas(cls, s, nan_as_null)
    729 with warnings.catch_warnings():
    730     warnings.simplefilter("ignore")
--> 731     result = cls(s, nan_as_null=nan_as_null)
    732 return result

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs)
    112 @wraps(func)
    113 def inner(*args, **kwargs):
    114     libnvtx_push_range(self.attributes, self.domain.handle)
--> 115     result = func(*args, **kwargs)
    116     libnvtx_pop_range(self.domain.handle)
    117     return result

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/series.py:651, in Series.__init__(self, data, index, dtype, name, copy, nan_as_null)
    633 if not isinstance(data, ColumnBase):
    634     # Using `getattr_static` to check if
    635     # `data` is on device memory and perform
   (...)
    641     # be expensive or mark a buffer as
    642     # unspillable.
    643     has_cai = (
    644         type(
    645             inspect.getattr_static(
   (...)
    649         is property
    650     )
--> 651     data = column.as_column(
    652         data,
    653         nan_as_null=nan_as_null,
    654         dtype=dtype,
    655         length=len(index) if index is not None else None,
    656     )
    657     if copy and has_cai:
    658         data = data.copy(deep=True)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/column/column.py:2073, in as_column(arbitrary, nan_as_null, dtype, length)
   2069 if cudf.get_option(
   2070     "mode.pandas_compatible"
   2071 ) and _is_pandas_nullable_extension_dtype(arbitrary.dtype):
   2072     raise NotImplementedError("not supported")
-> 2073 pyarrow_array = pa.array(arbitrary, from_pandas=nan_as_null)
   2074 if arbitrary.dtype == cudf.dtype("object") and cudf.dtype(
   2075     pyarrow_array.type.to_pandas_dtype()
   2076 ) != cudf.dtype(arbitrary.dtype):
   2077     raise MixedTypeError("Cannot create column with mixed types")

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/array.pxi:323, in pyarrow.lib.array()

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/array.pxi:83, in pyarrow.lib._ndarray_to_array()

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Could not convert 'a' with type str: tried to convert to double

In [10]: gs = cudf.from_pandas(s, nan_as_null=True)

In [11]: gs.to_pandas()
Out[11]: 
0    None     # In `cudf.pandas` this is problematic because we started with `np.nan` and now ending with `None`.
1       a
2       b
dtype: object
```

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: rapidsai/cudf-private#137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants