[Review] Multi-column merge and switch between hash and sort-based approaches #137

nsakharnykh · 2018-06-26T21:09:18Z

This pull request adds the following:

a new function merge that calls the multi-column libgdf code
a new test test_dataframe_multi_column_join that checks the merge function
a new type argument in join that allows to switch between the hash and sort-based implementations

…d values are 'sort' or 'hash'. Moved the original join code to to the _sortjoin() function that is called when type=='sort'. Created an stubbed _hashjoin() function for when type='hash'. Changes should not break existing functionality, but pytest tests are currently failing for unknown reasons.

…ns. Removed the 'left_on' and 'right_on' parameters. _hashjoin is still stubbed.

…column join. Currently the call to apply_join fails in the test_joining because it expects the inputs to be a List of NumericalColumns, whereas a List of Series are being passed.

… the join kernels results with cudautils.copy_array() for the left_indices and right_indices is required because currently apply_join frees the memory of its result. This seems inefficient. We'd like joined_indices to be able to reference the original apply_join result's device memory and have it freed when joined_indices goes out of scope.

kkraus14 · 2018-06-26T23:02:58Z

Depends on rapidsai/libgdf#52

kkraus14 · 2018-06-27T04:52:30Z

pygdf/numerical.py

+        if other is None:
+            return self._stub_merge(left_on=left_on, right_on=right_on,
+                                    how=how, return_indexers=return_indexers)
+


Should this chunk be removed? If other isn't defined or is defined as None we should fall to the next block that raises a TypeError that other is not compatible.

I think it should be removed. other can never be None for this method. Also, _stub_merge can be removed, too.

sklam · 2018-06-27T18:23:50Z

pygdf/dataframe.py

@@ -594,6 +594,144 @@ def _n_largest_or_smallest(self, method, n, columns, keep):
                df[k] = self[k].reset_index().take(new_positions)
        return df.set_index(self.index.take(new_positions))

+    def merge(self, other, on=None, how='multi-left', lsuffix='_x', rsuffix='_y'):


how should not expose "multi-left" to be more compatible with pandas. Also, the test_multi.py::test_dataframe_multi_column_join is failing because it is passing in left.

@sklam I don't think we should strive to maintain Pandas compatibility, though I agree we should have 'left' instead of 'multi-left' but we need another parameter to specify whether to use a sort based join or a hash based join.

@kkraus14, I agree. When the parameters do overlap with pandas, we should keep them compatible. But, it is okay to add new parameters that pandas do not have.

sklam · 2018-06-27T18:26:34Z

There are some coding style issues i.e. missing whitespaces around operators, line too long, unnecessary parentheses in if. I can make the coding style changes if that's okay by @nsakharnykh .

sklam · 2018-06-28T20:36:35Z

pygdf/numerical.py

+
+    def _hashjoin(self, other, how='left', return_indexers=False):
+
+        return 


Better an exception here if it is not yet implemented; i.e. raise NotImplementedError

sklam · 2018-06-28T20:37:05Z

pygdf/tests/test_joining.py

+
+    pd.util.testing.assert_frame_equal(join_result.to_pandas().sort_values(list(pddf_joined.columns)).reset_index(drop=True), pddf_joined)
+
+    # def normalize(x):


these commented lines can be removed.

sklam · 2018-06-29T21:05:08Z

Adding notes: this PR depends on rapidsai/libgdf#52

sklam · 2018-07-02T21:49:10Z

pygdf/dataframe.py

+        #joined_values, joined_indicies = self._stub_merge(
+        #    lhs, rhs, left_on=on, right_on=on, how=how, return_joined_indicies=True
+        #    )
+        joined_values, joined_indicies = self._merge_gdf(


_merge_gdf can't handle empty dataframe. It leads to a std::bad_alloc in libgdf. I am fixing it in my branch with sklam@6f2f574, which is needed for rapidsai/dask-cudf#22.

nsakharnykh · 2018-07-07T01:45:29Z

@sklam I’m ok if you push changes directly to this PR

…anges.

# Conflicts: # pygdf/dataframe.py # pygdf/numerical.py # pygdf/tests/test_joining.py

kkraus14 · 2018-07-13T22:37:31Z

pygdf/dataframe.py

@@ -594,6 +594,144 @@ def _n_largest_or_smallest(self, method, n, columns, keep):
                df[k] = self[k].reset_index().take(new_positions)
        return df.set_index(self.index.take(new_positions))

+    def merge(self, other, on=None, how='multi-left', lsuffix='_x', rsuffix='_y'):
+        if how != 'multi-left':
+            raise ValueError('{!r} join not implemented yet'.format(how))


This should be a NotImplementedError

kkraus14 · 2018-07-13T22:38:49Z

pygdf/dataframe.py

@@ -621,7 +759,7 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
        - *other* must be a single DataFrame for now.
        - *on* is not supported yet due to lack of multi-index support.
        """
-        if how not in 'left,right,inner,outer':
+        if how not in ['left', 'right', 'inner', 'outer']:
            raise ValueError('unsupported {!r} join'.format(how))
        if on is not None:
            raise ValueError('"on" is not supported yet')


This should be a NotImplementedError

kkraus14 · 2018-07-17T17:10:58Z

pygdf/_gdf.py

@@ -136,6 +136,7 @@ def apply_sort(col_keys, col_vals, ascending=True):
    'left': libgdf.gdf_left_join_generic,
    'inner': libgdf.gdf_inner_join_generic,
    'outer': libgdf.gdf_outer_join_generic,
+    'multi-left': libgdf.gdf_multi_left_join_generic,


I think we should make 'left' default to the libgdf.gdf_multi_left_join_generic and have a 'left-compat' that maps to libgdf.gdf_left_join_generic.

kkraus14 · 2018-07-17T17:11:21Z

pygdf/_gdf.py

-    # Call libgdf
-    joiner(col_lhs.cffi_view, col_rhs.cffi_view, join_result_ptr)
+
+    if(how=='multi-left'):


With the change above this should be 'left'

kkraus14 · 2018-07-17T17:13:25Z

pygdf/numerical.py

+            raise ValueError('Unsupported join type')
+
+    def _hashjoin(self, other, how='left', return_indexers=False):
+        msg = "numerical hash join not implemented yet."


This error message doesn't seem right. Is this for joining on the index? If so lets update the error message to "Hash joins on index not implemented yet."

kkraus14 · 2018-07-17T17:14:11Z

pygdf/tests/test_joining.py

+#     right_pos1 = pd.Series(right_pos1).fillna(magic).tolist()
+#     right_pos2 = pd.Series(right_pos2).fillna(magic).tolist()
+#     assert tuple(left_pos1)==tuple(left_pos2)
+#     assert tuple(right_pos1)==tuple(right_pos2)


I think we can remove this block of comments.

kkraus14 · 2018-07-17T21:57:05Z

Looks good to me and the build failed on something completely unrelated. Restarted it and upon successful completion I'll merge. Thanks!

EDIT: Still flake8 issues that need to be resolved.

kkraus14 · 2018-07-18T14:50:22Z

Looks good to me! Merging. Thanks all!

[REVIEW] Example code documentation template

@wraps

…de (#137) This PR resolves an issue in `from_pandas` API where `nan`'s present in a pandas series are being converted to `<NA>` values in `cudf` resulting in incorrect column representation: ```python In [1]: import pandas as pd In [2]: import numpy as np In [3]: import cudf In [4]: s = pd.Series([np.nan, "a", "b"]) In [5]: s Out[5]: 0 NaN 1 a 2 b dtype: object In [6]: cudf.set_option("mode.pandas_compatible", True) In [7]: gs = cudf.from_pandas(s) In [8]: gs Out[8]: 0 <NA> 1 a 2 b dtype: object In [9]: gs = cudf.from_pandas(s, nan_as_null=False) --------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) Cell In[9], line 1 ----> 1 gs = cudf.from_pandas(s, nan_as_null=False) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs) 112 @wraps(func) 113 def inner(*args, **kwargs): 114 libnvtx_push_range(self.attributes, self.domain.handle) --> 115 result = func(*args, **kwargs) 116 libnvtx_pop_range(self.domain.handle) 117 return result File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py:7874, in from_pandas(obj, nan_as_null) 7872 return DataFrame.from_pandas(obj, nan_as_null=nan_as_null) 7873 elif isinstance(obj, pd.Series): -> 7874 return Series.from_pandas(obj, nan_as_null=nan_as_null) 7875 elif isinstance(obj, pd.MultiIndex): 7876 return MultiIndex.from_pandas(obj, nan_as_null=nan_as_null) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs) 112 @wraps(func) 113 def inner(*args, **kwargs): 114 libnvtx_push_range(self.attributes, self.domain.handle) --> 115 result = func(*args, **kwargs) 116 libnvtx_pop_range(self.domain.handle) 117 return result File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/series.py:731, in Series.from_pandas(cls, s, nan_as_null) 729 with warnings.catch_warnings(): 730 warnings.simplefilter("ignore") --> 731 result = cls(s, nan_as_null=nan_as_null) 732 return result File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs) 112 @wraps(func) 113 def inner(*args, **kwargs): 114 libnvtx_push_range(self.attributes, self.domain.handle) --> 115 result = func(*args, **kwargs) 116 libnvtx_pop_range(self.domain.handle) 117 return result File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/series.py:651, in Series.__init__(self, data, index, dtype, name, copy, nan_as_null) 633 if not isinstance(data, ColumnBase): 634 # Using `getattr_static` to check if 635 # `data` is on device memory and perform (...) 641 # be expensive or mark a buffer as 642 # unspillable. 643 has_cai = ( 644 type( 645 inspect.getattr_static( (...) 649 is property 650 ) --> 651 data = column.as_column( 652 data, 653 nan_as_null=nan_as_null, 654 dtype=dtype, 655 length=len(index) if index is not None else None, 656 ) 657 if copy and has_cai: 658 data = data.copy(deep=True) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/column/column.py:2073, in as_column(arbitrary, nan_as_null, dtype, length) 2069 if cudf.get_option( 2070 "mode.pandas_compatible" 2071 ) and _is_pandas_nullable_extension_dtype(arbitrary.dtype): 2072 raise NotImplementedError("not supported") -> 2073 pyarrow_array = pa.array(arbitrary, from_pandas=nan_as_null) 2074 if arbitrary.dtype == cudf.dtype("object") and cudf.dtype( 2075 pyarrow_array.type.to_pandas_dtype() 2076 ) != cudf.dtype(arbitrary.dtype): 2077 raise MixedTypeError("Cannot create column with mixed types") File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/array.pxi:323, in pyarrow.lib.array() File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/array.pxi:83, in pyarrow.lib._ndarray_to_array() File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status() ArrowInvalid: Could not convert 'a' with type str: tried to convert to double In [10]: gs = cudf.from_pandas(s, nan_as_null=True) In [11]: gs.to_pandas() Out[11]: 0 None # In `cudf.pandas` this is problematic because we started with `np.nan` and now ending with `None`. 1 a 2 b dtype: object ``` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai/cudf-private#137

yashv28 and others added 15 commits June 13, 2018 16:08

Added test conditions for hash-based inner join

2f3c6b4

Tests changed to use pd.testing.assert_frame_equal

acca408

Merged Siu's pull request rapidsai#136

e799e2c

Modifying apply_join() for how='multi-left'

d64d540

remove ncols in apply_join()

744b3a7

Quick fix for the indentation error

7875342

Changed lkey,rkey to be passed as lists

ea97398

Fixed a bug so that a Python for loop uses range()

ff8d53b

Modified NumericalColumn.join() to be used only for single-column joi…

01a23a5

…ns. Removed the 'left_on' and 'right_on' parameters. _hashjoin is still stubbed.

Created a new _merge_gdf function in dataframe.py for use as a multi-…

60e573e

…column join. Currently the call to apply_join fails in the test_joining because it expects the inputs to be a List of NumericalColumns, whereas a List of Series are being passed.

Type conversion fixes

8e919b7

Sorting hash-joined GDF for assertion

a8f0f1f

len() vs __len__() and default multi-left changes

d2d2230

kkraus14 reviewed Jun 27, 2018

View reviewed changes

sklam reviewed Jun 27, 2018

View reviewed changes

sklam reviewed Jun 28, 2018

View reviewed changes

sklam mentioned this pull request Jun 29, 2018

[WIP] Multi-column merge rapidsai/dask-cudf#22

Closed

sklam reviewed Jul 2, 2018

View reviewed changes

dantegd and others added 4 commits July 10, 2018 20:21

Removed unnecessary block of code, added parameter and other minor ch…

38b1cc1

…anges.

Added missing copyright headers.

302a2fc

Merge branch 'master' into 'fea-ext-hash-join'

037e72d

# Conflicts: # pygdf/dataframe.py # pygdf/numerical.py # pygdf/tests/test_joining.py

Fix most style issues

9d8c233

kkraus14 reviewed Jul 13, 2018

View reviewed changes

Change ValueError -> NotImplementedError

4889bad

nsakharnykh changed the title ~~Multi-column merge and switch between hash and sort-based approaches~~ [Review] Multi-column merge and switch between hash and sort-based approaches Jul 16, 2018

kkraus14 reviewed Jul 17, 2018

View reviewed changes

dantegd added 2 commits July 17, 2018 13:52

small corrections

a77b694

pep8 style corrections

7795d34

kkraus14 mentioned this pull request Jul 17, 2018

WIP: multi-column join #136

Closed

missing import files and other minor corrections

7fb2cf6

kkraus14 merged commit dff90c7 into rapidsai:master Jul 18, 2018

mike-wendt pushed a commit that referenced this pull request Oct 26, 2018

Merge pull request #137 from jrhemstad/fea-ext-example-documentation

fe65ca4

[REVIEW] Example code documentation template

This was referenced Jul 26, 2022

[BUG] to_datetime and astype('datetime[]') left-pads zeros or fills with zeros to fractions of second #11350

Closed

[BUG] to_datetime and astype('datetime[]') interprets garbage input instead of returning an error #11351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Review] Multi-column merge and switch between hash and sort-based approaches #137

[Review] Multi-column merge and switch between hash and sort-based approaches #137

nsakharnykh commented Jun 26, 2018

kkraus14 commented Jun 26, 2018

kkraus14 Jun 27, 2018

sklam Jun 28, 2018

sklam Jun 27, 2018

kkraus14 Jun 27, 2018

sklam Jun 28, 2018

sklam commented Jun 27, 2018

sklam Jun 28, 2018 •

edited

Loading

sklam Jun 28, 2018

sklam commented Jun 29, 2018

sklam Jul 2, 2018

nsakharnykh commented Jul 7, 2018

kkraus14 Jul 13, 2018

kkraus14 Jul 13, 2018

kkraus14 Jul 17, 2018

kkraus14 Jul 17, 2018

kkraus14 Jul 17, 2018

kkraus14 Jul 17, 2018

kkraus14 commented Jul 17, 2018 •

edited

Loading

kkraus14 commented Jul 18, 2018


		def _hashjoin(self, other, how='left', return_indexers=False):

		return


		pd.util.testing.assert_frame_equal(join_result.to_pandas().sort_values(list(pddf_joined.columns)).reset_index(drop=True), pddf_joined)

		# def normalize(x):

[Review] Multi-column merge and switch between hash and sort-based approaches #137

[Review] Multi-column merge and switch between hash and sort-based approaches #137

Conversation

nsakharnykh commented Jun 26, 2018

kkraus14 commented Jun 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sklam commented Jun 27, 2018

sklam Jun 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sklam commented Jun 29, 2018

Choose a reason for hiding this comment

nsakharnykh commented Jul 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkraus14 commented Jul 17, 2018 • edited Loading

kkraus14 commented Jul 18, 2018

sklam Jun 28, 2018 •

edited

Loading

kkraus14 commented Jul 17, 2018 •

edited

Loading