Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support UDF Runtime compilation for incoming PTX with non-inlineable callees #8470

Closed
brandon-b-miller opened this issue Jun 9, 2021 · 5 comments · Fixed by #9174
Closed
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@brandon-b-miller
Copy link
Contributor

Describe the bug
The current udf compilation pipeline, used under Series.applymap, creates a generic PTX function to be inlined into a kernel in libcudf and finally compiled and launched using jitify. The PTX string is processed during an intermediate step into a CUDA C++ function by the libcudf parser. The problem seems to be that this workflow relies on there being exactly one function in the PTX string marked as .func. This is not always the case however.

Consider the following function:

def f(x):
    return 2**x

The PTX string that we get from numba is written in terms of a main function and a second function that the main function calls:

.func  (.param .b64 func_retval0) __internal_accurate_pow(	.param .b64 __internal_accurate_pow_param_0)

This results in the parser returning a malformed CUDA function, at least due to it picking up the wrong signature as well as probably some other issues. In general it looks like the pipeline doesn't support cases where the incoming PTX string is written as several separate functions.

Steps/Code to reproduce bug

>>> x = cudf.Series([1,2,3])                                                                                                                                                                                                                                                 
>>> x.applymap(lambda x: 2**x)

Traceback (most recent call last):                                                                                                                                                                                                                                           
  File "<stdin>", line 1, in <module>                                                                                                                                                                                                                                        
  File "/home/nfs/brmiller/anaconda3/envs/cudf_dev/lib/python3.8/site-packages/cudf/core/series.py", line 3972, in applymap                                                                                                                                                  
    return self._copy_construct(data=self._unaryop(udf))                                                                                                                                                                                                                     
  File "/home/nfs/brmiller/anaconda3/envs/cudf_dev/lib/python3.8/site-packages/cudf/core/frame.py", line 2279, in _unaryop                                                                                                                                                   
    return self.__class__._from_table(Frame(data, self._index))                                                                                                                                                                                                              
  File "cudf/_lib/table.pyx", line 44, in cudf._lib.table.Table.__init__                                                                                                                                                                                                     
  File "/home/nfs/brmiller/anaconda3/envs/cudf_dev/lib/python3.8/site-packages/cudf/core/column_accessor.py", line 121, in __init__                                                                                                                                          
    data = dict(data)                                                                                                                                                                                                                                                        
  File "/home/nfs/brmiller/anaconda3/envs/cudf_dev/lib/python3.8/site-packages/cudf/core/frame.py", line 2277, in <genexpr>                                                                                                                                                  
    data_columns = (col.unary_operator(op) for col in self._columns)
  File "/home/nfs/brmiller/anaconda3/envs/cudf_dev/lib/python3.8/site-packages/cudf/core/column/numerical.py", line 118, in unary_operator
    return _numeric_column_unaryop(self, op=unaryop)
  File "/home/nfs/brmiller/anaconda3/envs/cudf_dev/lib/python3.8/site-packages/cudf/core/column/numerical.py", line 567, in _numeric_column_unaryop
    return libcudf.transform.transform(operand, op)
  File "cudf/_lib/transform.pyx", line 116, in cudf._lib.transform.transform
RuntimeError: Compilation failed: NVRTC_ERROR_COMPILATION
Compiler options: "-std=c++17 -D__CUDACC_RTC__ -default-device -arch=sm_70"
Compiler options: "-std=c++17 -D__CUDACC_RTC__ -default-device -arch=sm_70"
Header names:
  cassert
  cfloat
  climits
  cstddef
  cstdint
  ctime
  cuda/std/chrono
  cuda/std/climits
  cuda/std/cstddef
  cuda/std/limits
  cuda/std/type_traits
  cudf/types.hpp
  cudf/wrappers/durations.hpp
  cudf/wrappers/timestamps.hpp
  detail/__config
  detail/__pragma_pop
  detail/__pragma_push
  detail/libcxx/include/chrono
  detail/libcxx/include/climits
  detail/libcxx/include/cstddef
  detail/libcxx/include/ctime
  detail/libcxx/include/limits
  detail/libcxx/include/ratio
  detail/libcxx/include/type_traits
  detail/libcxx/include/version
  iterator
  libcxx/include/__config
  libcxx/include/__pragma_pop
  libcxx/include/__pragma_push
  libcxx/include/__undef_macros
  limits
  ratio
  transform/jit/operation-udf.hpp
  type_traits
  version

transform/jit/operation-udf.hpp(18): error: identifier "_ZN8__main__16_3clambda_3e_241Ex_param_0" is undefined

transform/jit/operation-udf.hpp(18): error: an asm operand must have scalar type

transform/jit/operation-udf.hpp(21): error: identifier "_ZN8__main__16_3clambda_3e_241Ex_param_1" is undefined

transform/jit/operation-udf.hpp(21): error: an asm operand must have scalar type

transform/jit/operation-udf.hpp(86): error: identifier "retval0_0" is undefined

transform/jit/operation-udf.hpp(86): error: an asm operand must have scalar type

transform/jit/kernel.cu(12): error: too many arguments in function call
          detected during instantiation of "void cudf::transformation::jit::kernel(cudf::size_type, TypeOut *, TypeIn *) [with TypeOut=int64_t, TypeIn=int64_t]"

7 errors detected in the compilation of "transform/jit/kernel.cu".

This is a problem as users can't always use pow in their UDFs as well as possibly a bunch of other trig functions and anything else numba will probably generate multiple PTX functions for.

Expected behavior

0    2
1    4
2    8
dtype: int64

Environment overview (please complete the following information)

  • Environment location: [Bare-metal]
  • Method of cuDF install: source]
@brandon-b-miller brandon-b-miller added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Jitify labels Jun 9, 2021
@jrhemstad
Copy link
Contributor

CC @hummingtree

@hummingtree
Copy link
Contributor

@brandon-b-miller This is expected. The same reason applies to why this workflow does not apply to, say, sin and cos.

To make this possible, one needs to have a more comprehensive PTX to CUDA parser than the current one. In the past I did look at this, but to have a parser this fancy, one probably needs to use abstract syntax tree, which is more than what I could do in the short term.

@brandon-b-miller
Copy link
Contributor Author

@hummingtree thank you for clarifying- that makes sense. It looks like the changes to make this work are nontrivial.

Still, not being able to support pow or some trig functions are seems like a bit of a hole in the kinds of user defined functions we can support. As such I'd like to explore some options for solving the problem over the long term. It's not immediately clear to me what kind of changes to the parser would accomplish this - perhaps some kind of custom inlining of the child function? If we can come up with a concrete idea I am happy to handle the impl.

@brandon-b-miller brandon-b-miller changed the title [BUG] UDF Runtime compilation pipeline fails when incoming PTX uses non-inlineable callees [FEA] Support UDF Runtime compilation for incoming PTX with non-inlineable callees Jun 9, 2021
@brandon-b-miller brandon-b-miller added feature request New feature or request and removed bug Something isn't working labels Jun 9, 2021
@hummingtree
Copy link
Contributor

So if from numba, we get two functions, one of which calls the other, then we need to first recognize there are two PTX functions, and convert those two PTX functions to CUDA functions. Then we need to do one of the two:

  1. In one of the functions we will need to convert the PTX function call to a CUDA function call, which calls the other CUDA function we just converted previously.
  2. We manually inline the call'ee function into the caller function. I have not given this a through thinking, but it will be non-trivial as well.

@brandon-b-miller
Copy link
Contributor Author

Ok - I will try and POC something and report back here.

rapids-bot bot pushed a commit that referenced this issue Sep 29, 2021
Replaces C++ implementation of masked UDF pipeline with a pure python version which compiles and launches the entire kernel using numba. This solves a bunch of problems:

- CUDA 11.0 support is now available since the impl no longer needs `cuda::std::tuple` to work with NVRTC 11.0. 
- Support for special functions which compile to multiple function definitions, such as `pow`, `sin`, and `cos` is now provided since all the PTX is compiled and linked inside numba (Fixes #8470)
- Allows us to support this corner case which would require a separate c++ kernel in previous implementation
```python
def f(x):
    return 42
```

- Makes developing/adding features to the impl much easier

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Graham Markall (https://github.com/gmarkall)
  - Ashwin Srinath (https://github.com/shwina)

URL: #9174
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants