Generalized null support in user defined functions #8213

brandon-b-miller · 2021-05-11T16:03:51Z

Draft

Adds DataFrame.apply similar to Pandas
Adds support for automatically including the validity of the operand columns in the computation of the result
Adds support for involving cudf.NA in user defined functions explicitly

This PR creates the following API:

@nulludf
def func_gdf(x, y):
    if x is cudf.NA:
        return y
    else:
        return x + y


gdf = cudf.DataFrame({
    'a':[1,None,3, None],
    'b':[4,5,None, None]
})
gdf.apply(lambda row: func_gdf(row['a'], row['b']), axis=1)

# 0       5
# 1       5
# 2    <NA>
# 3    <NA>
# dtype: int64

python/cudf/cudf/core/dataframe.py

ttnghia · 2021-07-13T20:35:03Z

cpp/include/cudf/transform.hpp

@@ -53,6 +53,12 @@ std::unique_ptr<column> transform(
  bool is_ptx,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

+std::unique_ptr<column> generalized_masked_op(
+  table_view data_view,


Typically we pass in table_view const& as copying it may involve recursively copying its children column_view which is more expensive.

You may need to be modified to use table_view const& (not just this, but in other places too).

ttnghia · 2021-07-13T20:35:40Z

cpp/src/transform/jit/masked_udf_kernel.cu

+ * limitations under the License.
+ */
+
+// Include Jitify's cstddef header first


Why? The convention in cudf is to include from "near" to "far". So, you include <transform/...> first, then <cudf/...>, then <cuda/...>, then std headers finally.

I think the problem here is that technically when this file is runtime compilated later, transform/jit/operation-udf.hpp gets string replaced by by an actual function definition that might contain the types in the std headers. So I think at least the order of those two headers is critical.

ttnghia · 2021-07-13T20:37:51Z

cpp/src/transform/transform.cpp

@@ -15,6 +15,7 @@
 */

 #include <jit_preprocessed_files/transform/jit/kernel.cu.jit.hpp>
+#include <jit_preprocessed_files/transform/jit/masked_udf_kernel.cu.jit.hpp>


I believe that jit headers should be included after cudf headers.

ttnghia · 2021-07-13T20:39:17Z

cpp/src/transform/transform.cpp

+  template_types.reserve(data_view.num_columns() + 1);
+
+  template_types.push_back(cudf::jit::get_type_name(outcol_view.type()));
+  for (auto const& col : data_view) {
+    template_types.push_back(cudf::jit::get_type_name(col.type()) + "*");
+    template_types.push_back(mskptr_type);
+    template_types.push_back(offset_type);
+  }


Wait, I see that you call push_back by 3*num_cols() + 1 times instead of num_cols() + 1.

nice catch - this was unsafe. Fixed

cpp/src/transform/transform.cpp

ttnghia · 2021-07-13T20:42:20Z

cpp/src/transform/transform.cpp

+  rmm::cuda_stream_view generic_stream;
+  cudf::jit::get_program_cache(*transform_jit_masked_udf_kernel_cu_jit)
+    .get_kernel(generic_kernel_name,
+                {},
+                {{"transform/jit/operation-udf.hpp", generic_cuda_source}},
+                {"-arch=sm_."})                                    //
+    ->configure_1d_max_occupancy(0, 0, 0, generic_stream.value())  //


Why generic_stream is used without initialization? Are you using the default stream? If so, call default stream directly.

This should be fixed.

ttnghia · 2021-07-13T20:44:17Z

cpp/src/transform/transform.cpp

+    data_ptrs.push_back(cudf::jit::get_data_ptr(col));
+    mask_ptrs.push_back(col.null_mask());
+    offsets.push_back(col.offset());
+
+    kernel_args.push_back(&data_ptrs[col_idx]);
+    kernel_args.push_back(&mask_ptrs[col_idx]);
+    kernel_args.push_back(&offsets[col_idx]);
+  }


Can we use some type of std::transform instead? Using raw loop is discouraged.

This is difficult due to the 1->3 transform going on here. I kept trying to do the same, but couldn't get anything that was cleaner.

How about using thrust::zip_iterator (host callable)? You can output to 3 values at the same time.

I managed to use zip_iterator to replace about half the logic here. One loop though I did not see how to simplify, open to suggestions here.

cpp/src/transform/transform.cpp

ttnghia · 2021-07-13T20:50:47Z

cpp/src/transform/transform.cpp

+                           mutable_column_view outmsk_view,
+                           rmm::mr::device_memory_resource* mr)
+{
+  std::vector<std::string> template_types = make_template_types(outcol_view, data_view);


One more thing I want to note is that, you can use auto const for declaring almost everything, instead of writing lengthy types like this. I.e.,

auto const template_types =...

Co-authored-by: Nghia Truong <[email protected]>

Co-authored-by: GALI PREM SAGAR <[email protected]>

… fea-udf-nulls

cpp/src/transform/jit/masked_udf_kernel.cu

brandon-b-miller · 2021-07-15T19:36:34Z

I think this is ready for another look cc @ttnghia

ttnghia · 2021-07-15T19:42:22Z

cpp/include/cudf/transform.hpp

@@ -53,6 +53,12 @@ std::unique_ptr<column> transform(
  bool is_ptx,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

+std::unique_ptr<column> generalized_masked_op(
+  table_view data_view,


You may need to be modified to use table_view const& (not just this, but in other places too).

cpp/src/transform/transform.cpp

Co-authored-by: Nghia Truong <[email protected]>

brandon-b-miller · 2021-07-16T15:19:20Z

rerun tests

brandon-b-miller · 2021-07-16T18:08:35Z

@gpucibot merge

This PR removes the c++ side of the original masked UDF code introduced in #8213. These kernels had some limitations and are now superseded by the numba-generated versions we moved to in #9174. As far as I can tell, cuDF python was the only thing consuming this API for the short time it has existed. However I am marking this breaking just in case. Authors: - https://github.com/brandon-b-miller Approvers: - Mark Harris (https://github.com/harrism) - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9792

brandon-b-miller added 30 commits February 24, 2021 04:06

just debugging info

a7bbfb4

Merge branch 'branch-0.19' into fea-udf-nulls

f44c196

initial python MaskedType

193f8e0

a little cleanup

91ae6a3

basic bindings, header, placeholder c++ code

a855a6f

missed one cython file - bindings work and run

1b2c00c

fix bug

7584ad3

little more progress

4988b14

an attempt at NA plumbing

7a6427c

a little more plubming and prototyping

5e6eb06

lots of progress

ea15da6

Merge branch 'branch-0.19' into fea-udf-nulls

d1119b2

trying to plumb to jitify launcher

961a9dd

Merge branch 'branch-0.19' into fea-udf-nulls

664bf79

progress on jitify template/launch

5e93094

null kernel launches with all arguments

03edceb

bit_is_set works

2b4c36f

successfully passing struct through the ptx function

3f76df5

pipeline fully runs

db88f9e

it lives

9a67670

cleanup and add notebook

b9da4bf

take the plunge and merge 0.20

ec302f3

integrate jitify2

cb85d88

minor cleanup

ad067eb

pushing forward with ND transform

237af25

variadic kernel up and running

8e11c7e

big plays

591627c

general logic for building template instantiation arguments

c07e187

cleanup

d21b858

attempting to use vector overload in jitify

6806968

Merge branch 'branch-21.08' into fea-udf-nulls

51ce28f

galipremsagar reviewed Jul 13, 2021

View reviewed changes

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved

ttnghia requested changes Jul 13, 2021

View reviewed changes

ttnghia reviewed Jul 13, 2021

View reviewed changes

brandon-b-miller and others added 4 commits July 13, 2021 17:29

Apply suggestions from code review

993d841

Co-authored-by: Nghia Truong <[email protected]>

partially address reviews

512555b

Apply suggestions from code review

8f1add4

Co-authored-by: GALI PREM SAGAR <[email protected]>

Merge branch 'fea-udf-nulls' of github.com:brandon-b-miller/cudf into…

d683db9

… fea-udf-nulls

ttnghia reviewed Jul 13, 2021

View reviewed changes

cpp/src/transform/jit/masked_udf_kernel.cu Outdated Show resolved Hide resolved

brandon-b-miller added 2 commits July 14, 2021 11:57

updates

b061710

style

7c722dd

galipremsagar approved these changes Jul 15, 2021

View reviewed changes

ttnghia requested changes Jul 15, 2021

View reviewed changes

brandon-b-miller added 2 commits July 15, 2021 13:12

use table_view const&

7a7ee83

switch to a lambda

a20d630

ttnghia approved these changes Jul 15, 2021

View reviewed changes

cpp/src/transform/transform.cpp Outdated Show resolved Hide resolved

brandon-b-miller and others added 2 commits July 15, 2021 21:52

Update cpp/src/transform/transform.cpp

a13e935

Co-authored-by: Nghia Truong <[email protected]>

updates

9acc7a9

brandon-b-miller added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 16, 2021

rapids-bot bot merged commit 7ff4724 into rapidsai:branch-21.08 Jul 16, 2021

brandon-b-miller mentioned this pull request Jul 19, 2021

[FEA] Properly raise when attempting to cast NA to bool inside UDFs #8774

Open

brandon-b-miller mentioned this pull request Nov 29, 2021

Remove unused masked udf cython/c++ code #9792

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalized null support in user defined functions #8213

Generalized null support in user defined functions #8213

brandon-b-miller commented May 11, 2021 •

edited

Loading

ttnghia Jul 13, 2021

ttnghia Jul 15, 2021

ttnghia Jul 13, 2021

brandon-b-miller Jul 14, 2021

ttnghia Jul 13, 2021

brandon-b-miller Jul 14, 2021

ttnghia Jul 13, 2021

brandon-b-miller Jul 14, 2021

ttnghia Jul 13, 2021

brandon-b-miller Jul 14, 2021

ttnghia Jul 13, 2021

hyperbolic2346 Jul 14, 2021

ttnghia Jul 14, 2021 •

edited

Loading

brandon-b-miller Jul 14, 2021

ttnghia Jul 13, 2021 •

edited

Loading

brandon-b-miller Jul 14, 2021

brandon-b-miller commented Jul 15, 2021

ttnghia Jul 15, 2021

brandon-b-miller commented Jul 16, 2021

brandon-b-miller commented Jul 16, 2021

Generalized null support in user defined functions #8213

Generalized null support in user defined functions #8213

Conversation

brandon-b-miller commented May 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Jul 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Jul 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-b-miller commented Jul 15, 2021

Choose a reason for hiding this comment

brandon-b-miller commented Jul 16, 2021

brandon-b-miller commented Jul 16, 2021

brandon-b-miller commented May 11, 2021 •

edited

Loading

ttnghia Jul 14, 2021 •

edited

Loading

ttnghia Jul 13, 2021 •

edited

Loading