Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for strings for comparison in UDFs / fillna for categorical columns #111

Closed
kkraus14 opened this issue Mar 6, 2018 · 3 comments
Closed
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Python Affects Python cuDF API.

Comments

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 6, 2018

If using a categorical column and you try to use .query() or .fillna() you need to use the categorical code rather than the value of the categorical.

Example:

import pandas as pd
from pygdf.dataframe import DataFrame

pdf = pd.DataFrame({"key": ["a", "b", "c", "d", "e", None, "f", None, "null_key"], "value": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
pdf['key'] = pdf['key'].astype('category')

gdf = DataFrame.from_pandas(pdf)

These work by giving the index of null_key in gdf['key'].cat.categories:

gdf['key'] = gdf['key'].fillna(6)
gdf.query('key == 6').head().to_pandas()

These do not work and fail with "Failed at nopython (nopython frontend)". I assume this is due to numba being unable to compile the function with a string type?

gdf['key'] = gdf['key'].fillna("null_key")
gdf.query('key == "null_key"').head().to_pandas()
@mike-wendt mike-wendt added feature request New feature or request 0 - Backlog In queue waiting for assignment labels Aug 6, 2018
@mike-wendt mike-wendt changed the title Allow for strings for comparison in UDFs / fillna for categorical columns Allow for strings for comparison in UDFs / fillna for categorical columns Aug 8, 2018
@kkraus14 kkraus14 added the Python Affects Python cuDF API. label Dec 12, 2018
@Salonijain27
Copy link

Updating the ex code shown above and running it in cudf 0.15:

import pandas as pd
from cudf import DataFrame
pdf = pd.DataFrame({"key": ["a", "b", "c", "d", "e", None, "f", None, "null_key"], "value": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
gdf = DataFrame.from_pandas(pdf)
gdf['key'] = gdf['key'].astype('category')
gdf['key'] = gdf['key'].fillna("f")
gdf.query('key == "f"').head().to_pandas()

The above code produces the following error:

---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
<ipython-input-15-de60357f3344> in <module>
      8 gdf['key'] = gdf['key'].astype('category')
      9 gdf['key'] = gdf['key'].fillna("f")
---> 10 gdf.query('key == "f"').head().to_pandas()
     11

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/dataframe.py in query(self, expr, local_dict)
   3911             }
   3912             # Run query
-> 3913             boolmask = queryutils.query_execute(self, expr, callenv)
   3914             return self._apply_boolean_mask(boolmask)
   3915

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/utils/queryutils.py in query_execute(df, expr, callenv)
    229     # run kernel
    230     args = [out] + colarrays + envargs
--> 231     kernel.forall(nrows)(*args)
    232     out_mask = applyutils.make_aggregate_nullmask(df, columns=columns)
    233     return out.set_mask(out_mask).fillna(False)

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/cuda/compiler.py in __call__(self, *args)
    338
    339         if isinstance(self.kernel, AutoJitCUDAKernel):
--> 340             kernel = self.kernel.specialize(*args)
    341         else:
    342             kernel = self.kernel

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/cuda/compiler.py in specialize(self, *args)
    842         argtypes = tuple(
    843             [self.typingctx.resolve_argument_type(a) for a in args])
--> 844         kernel = self.compile(argtypes)
    845         return kernel
    846

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/cuda/compiler.py in compile(self, sig)
    857             if 'link' not in self.targetoptions:
    858                 self.targetoptions['link'] = ()
--> 859             kernel = compile_kernel(self.py_func, argtypes,
    860                                     **self.targetoptions)
    861             self.definitions[(cc, argtypes)] = kernel

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler_lock.py in _acquire_compile_lock(*args, **kwargs)
     30         def _acquire_compile_lock(*args, **kwargs):
     31             with self:
---> 32                 return func(*args, **kwargs)
     33         return _acquire_compile_lock
     34

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/cuda/compiler.py in compile_kernel(pyfunc, args, link, debug, inline, fastmath, extensions, max_registers)
     53 def compile_kernel(pyfunc, args, link, debug=False, inline=False,
     54                    fastmath=False, extensions=[], max_registers=None):
---> 55     cres = compile_cuda(pyfunc, types.void, args, debug=debug, inline=inline)
     56     fname = cres.fndesc.llvm_func_name
     57     lib, kernel = cres.target_context.prepare_cuda_kernel(cres.library, fname,

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler_lock.py in _acquire_compile_lock(*args, **kwargs)
     30         def _acquire_compile_lock(*args, **kwargs):
     31             with self:
---> 32                 return func(*args, **kwargs)
     33         return _acquire_compile_lock
     34

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/cuda/compiler.py in compile_cuda(pyfunc, return_type, args, debug, inline)
     36         flags.set('forceinline')
     37     # Run compilation pipeline
---> 38     cres = compiler.compile_extra(typingctx=typingctx,
     39                                   targetctx=targetctx,
     40                                   func=pyfunc,

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler.py in compile_extra(typingctx, targetctx, func, args, return_type, flags, locals, library, pipeline_class)
    601     pipeline = pipeline_class(typingctx, targetctx, library,
    602                               args, return_type, flags, locals)
--> 603     return pipeline.compile_extra(func)
    604
    605

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler.py in compile_extra(self, func)
    337         self.state.lifted = ()
    338         self.state.lifted_from = None
--> 339         return self._compile_bytecode()
    340
    341     def compile_ir(self, func_ir, lifted=(), lifted_from=None):

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler.py in _compile_bytecode(self)
    399         """
    400         assert self.state.func_ir is None
--> 401         return self._compile_core()
    402
    403     def _compile_ir(self):

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler.py in _compile_core(self)
    379                 self.state.status.fail_reason = e
    380                 if is_final_pipeline:
--> 381                     raise e
    382         else:
    383             raise CompilerError("All available pipelines exhausted")

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler.py in _compile_core(self)
    370             res = None
    371             try:
--> 372                 pm.run(self.state)
    373                 if self.state.cr is not None:
    374                     break

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler_machinery.py in run(self, state)
    339                     (self.pipeline_name, pass_desc)
    340                 patched_exception = self._patch_error(msg, e)
--> 341                 raise patched_exception
    342
    343     def dependency_analysis(self):

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler_machinery.py in run(self, state)
    330                 pass_inst = _pass_registry.get(pss).pass_inst
    331                 if isinstance(pass_inst, CompilerPass):
--> 332                     self._runPass(idx, pass_inst, state)
    333                 else:
    334                     raise BaseException("Legacy pass in use")

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler_lock.py in _acquire_compile_lock(*args, **kwargs)
     30         def _acquire_compile_lock(*args, **kwargs):
     31             with self:
---> 32                 return func(*args, **kwargs)
     33         return _acquire_compile_lock
     34

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler_machinery.py in _runPass(self, index, pss, internal_state)
    289             mutated |= check(pss.run_initialization, internal_state)
    290         with SimpleTimer() as pass_time:
--> 291             mutated |= check(pss.run_pass, internal_state)
    292         with SimpleTimer() as finalize_time:
    293             mutated |= check(pss.run_finalizer, internal_state)

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/compiler_machinery.py in check(func, compiler_state)
    262
    263         def check(func, compiler_state):
--> 264             mangled = func(compiler_state)
    265             if mangled not in (True, False):
    266                 msg = ("CompilerPass implementations should return True/False. "

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/typed_passes.py in run_pass(self, state)
     90                               % (state.func_id.func_name,)):
     91             # Type inference
---> 92             typemap, return_type, calltypes = type_inference_stage(
     93                 state.typingctx,
     94                 state.func_ir,

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/typed_passes.py in type_inference_stage(typingctx, interp, args, return_type, locals, raise_errors)
     68
     69         infer.build_constraint()
---> 70         infer.propagate(raise_errors=raise_errors)
     71         typemap, restype, calltypes = infer.unify(raise_errors=raise_errors)
     72

~/miniconda3/envs/branch15/lib/python3.8/site-packages/numba/core/typeinfer.py in propagate(self, raise_errors)
    992                                   if isinstance(e, ForceLiteralArg)]
    993                 if not force_lit_args:
--> 994                     raise errors[0]
    995                 else:
    996                     raise reduce(operator.or_, force_lit_args)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Internal error at <numba.core.typeinfer.CallConstraint object at 0x7fa088c33430>.
could not get source code
During: resolving callee type: Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7fa0b0183550>)
During: typing of call at <string> (6)

Enable logging at debug level for details.

File "<string>", line 6:
<source missing, REPL/exec in use?>

Furthermore, using :

gdf['key'] = gdf['key'].fillna(6)
gdf.query('key == 6').head().to_pandas()

Throws the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/column/categorical.py in fillna(self, fill_value)
   1068                 try:
-> 1069                     fill_value = self._encode(fill_value)
   1070                     fill_value = self.codes.dtype.type(fill_value)

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/column/categorical.py in _encode(self, value)
    979     def _encode(self, value):
--> 980         return self.categories.find_first_value(value)
    981

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/column/string.py in find_first_value(self, value, closest)
   4489     def find_first_value(self, value, closest=False):
-> 4490         return self._find_first_and_last(value)[0]
   4491

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/column/string.py in _find_first_and_last(self, value)
   4484         found_indices = libcudf.unary.cast(found_indices, dtype=np.int32)
-> 4485         first = column.as_column(found_indices).find_first_value(1)
   4486         last = column.as_column(found_indices).find_last_value(1)

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/column/numerical.py in find_first_value(self, value, closest)
    343         elif found == -1:
--> 344             raise ValueError("value not found")
    345         return found

ValueError: value not found

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-16-f18c30473ccb> in <module>
      7 gdf = DataFrame.from_pandas(pdf)
      8 gdf['key'] = gdf['key'].astype('category')
----> 9 gdf['key'] = gdf['key'].fillna(6)
     10 gdf.query('key == 6').head().to_pandas()
     11

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/series.py in fillna(self, value, method, axis, inplace, limit)
   1836             raise NotImplementedError("The axis keyword is not supported")
   1837
-> 1838         data = self._column.fillna(value)
   1839
   1840         if inplace:

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/column/categorical.py in fillna(self, fill_value)
   1071                 except (ValueError) as err:
   1072                     err_msg = "fill value must be in categories"
-> 1073                     raise ValueError(err_msg) from err
   1074         else:
   1075             fill_value = column.as_column(fill_value, nan_as_null=False)

ValueError: fill value must be in categories

@beckernick
Copy link
Member

We've implemented fillna for categorical columns.

import cudf
import pandas as pds = cudf.Series(["a","b",None,"b"]).astype("category")
s.fillna("a")
0    a
1    b
2    a
3    b
dtype: category
Categories (2, object): ['a', 'b']

If the remaining request here boils down to supporting string types in UDFs, perhaps we might want to close this issue (as it's covered by others, possibly #9639)?

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jun 30, 2022

Thanks Nick, I agree. Let's close in favor of #9639

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

5 participants