You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When TargetEncoding op is used with multiple target columns, it might switch the content of the target columns.
Furthermore, the internal statistics (count, sum) saved with the NVT workflow for the target columns are also switched.
Go to the Quick-start for ranking example folder: cd /Merlin/examples/quick_start/scripts/preproc
Edit the preprocessing.py to add a TargetEncoding op inside the generate_nvt_workflow_features() method, like this. Notice it includes multiple target columns: ["is_clicked", "is_installed"]
Inspect the output preprocessed parquet files in /outputs/shrcht_preproc_01_te/train. You will notice that the values of the targets is_clicked and is_installed are now switched compared to the original raw data. P.s. You can use the column f_0 as the primary key to find the corresponding rows in the raw train dataset and the preprocessed dataset.
Now inspect the NVT workflow statistics parquet file for target encoding, found in /outputs/shrcht_preproc_01_te/workflow/categories/cat_stats.__fold___f_6.parquet. That file contains the count and sum of each categorical value of f_6 with respect with the targets. If you compute those statistics manually from raw data (e.g. using something like ddf.groupby('f_6')[["is_clicked", "is_installed"]].agg("sum"), you will notice that the sum of the targets are switched compared to the raw data (i.e. sum of positive "is_installed" events will be higher than positive "is_clicked", which is typically not the real scenario).
Now change again the preprocessing.py script and split that TargetEncoding op in two ops, one for each target, like this
Describe the bug
When
TargetEncoding
op is used with multiple target columns, it might switch the content of the target columns.Furthermore, the internal statistics (count, sum) saved with the NVT workflow for the target columns are also switched.
Steps/Code to reproduce bug
cd /Merlin/examples/quick_start/scripts/preproc
preprocessing.py
to add a TargetEncoding op inside thegenerate_nvt_workflow_features()
method, like this. Notice it includes multiple target columns:["is_clicked", "is_installed"]
Inspect the output preprocessed parquet files in
/outputs/shrcht_preproc_01_te/train
. You will notice that the values of the targetsis_clicked
andis_installed
are now switched compared to the original raw data. P.s. You can use the columnf_0
as the primary key to find the corresponding rows in the raw train dataset and the preprocessed dataset.Now inspect the NVT workflow statistics parquet file for target encoding, found in
/outputs/shrcht_preproc_01_te/workflow/categories/cat_stats.__fold___f_6.parquet
. That file contains the count and sum of each categorical value of f_6 with respect with the targets. If you compute those statistics manually from raw data (e.g. using something like ddf.groupby('f_6')[["is_clicked", "is_installed"]].agg("sum"), you will notice that the sum of the targets are switched compared to the raw data (i.e. sum of positive "is_installed" events will be higher than positive "is_clicked", which is typically not the real scenario).Now change again the
preprocessing.py
script and split thatTargetEncoding
op in two ops, one for each target, like thisExpected behavior
TargetEncoding
should not switch the target columns values and also target encoded feature values.Environment details (please complete the following information):
The text was updated successfully, but these errors were encountered: