[BUG] TargetEncoding requires the target columns to exist in a dataset in transform() #1840

gabrielspmoreira · 2023-06-08T01:01:58Z

Describe the bug
When a TargetEncoding op is present in an NVTabular workflow, during fit() NVTabular computes the mean (count,sum) statistics for categorical values with respect to the target column.
Althought, when using this fitted workflow to transform() a dataset (for prediction), NVTabular requires the prediction dataset to contain the target columns (which we wanna predict) and raises the following error.

File "preprocessing.py", line 199, in run
    new_predict_dataset = nvt_workflow_features.transform(predict_dataset)
  File "/usr/lib/python3.8/functools.py", line 912, in _method
    return method.__get__(obj, cls)(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvtabular/workflow/workflow.py", line 115, in _
    return self._transform_impl(dataset)
  File "/usr/local/lib/python3.8/dist-packages/nvtabular/workflow/workflow.py", line 271, in _transform_impl
    ddf = dataset.to_ddf(columns=self._input_columns())
  File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 401, in to_ddf
    ddf = self.engine.to_ddf(columns=columns)
  File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataframe_engine.py", line 44, in to_ddf
    return _ddf[columns]
  File "/usr/local/lib/python3.8/dist-packages/dask/dataframe/core.py", line 4648, in __getitem__
    meta = self._meta[_extract_meta(key)]
  File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py", line 1169, in __getitem__
    return self._get_columns_by_label(mask)
  File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py", line 1893, in _get_columns_by_label
    new_data = super()._get_columns_by_label(labels, downcast)
  File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 418, in _get_columns_by_label
    return self._data.select_by_label(labels)
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/column_accessor.py", line 338, in select_by_label
    return self._select_by_label_list_like(key)
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/column_accessor.py", line 453, in _select_by_label_list_like
    data = {k: self._grouped_data[k] for k in key}
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/column_accessor.py", line 453, in <dictcomp>
    data = {k: self._grouped_data[k] for k in key}
KeyError: 'is_installed'

The target encoded values should be retrieved from statistics computed in fit() and target columns shouldn't be required.

P.s. As a workaround I have been creating dummy target columns in the prediction dataset to avoid that error in transform()

Steps/Code to reproduce bug

Create an NVTabular workflow that includes a target encoded feature

target_encoding = (
            ["feature1", "feature2"]
            >> nvt.ops.TargetEncoding(
                ["is_installed"],
                kfold=5,
                p_smooth=10,
                out_dtype="float32",
            )
        )

Run worflow.fit() with a dataset that contains feature1, feature2, is_installed
Run worflow.transform() on a dataset that contains feature1, feature2, but NOT is_installed

Expected behavior

Target columns should not be required in workflows with TargetEncoding when transforming

The text was updated successfully, but these errors were encountered:

rnyak · 2023-06-08T01:48:46Z

similar ticket: #1582

gabrielspmoreira added the bug Something isn't working label Jun 8, 2023

gabrielspmoreira added this to the Merlin 23.06 milestone Jun 8, 2023

gabrielspmoreira added the P1 label Jun 14, 2023

gabrielspmoreira modified the milestones: Merlin 23.06, Merlin 23.07 Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] TargetEncoding requires the target columns to exist in a dataset in transform() #1840

[BUG] TargetEncoding requires the target columns to exist in a dataset in transform() #1840

gabrielspmoreira commented Jun 8, 2023 •

edited

Loading

rnyak commented Jun 8, 2023

[BUG] TargetEncoding requires the target columns to exist in a dataset in transform() #1840

[BUG] TargetEncoding requires the target columns to exist in a dataset in transform() #1840

Comments

gabrielspmoreira commented Jun 8, 2023 • edited Loading

rnyak commented Jun 8, 2023

gabrielspmoreira commented Jun 8, 2023 •

edited

Loading