[BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys #1864

piojanu · 2023-09-20T22:10:14Z

Describe the bug
ops.Categorify raises ValueError: Column must have no nulls. when num_buckets > 1 and the dataset is shuffled by keys. EDIT: The whole error message: https://pastebin.com/GJRQhxAi

Steps/Code to reproduce bug

import gc

import dask.dataframe as dd
import numpy as np
import pandas as pd

import nvtabular as nvt

# Generate synthetic data
N_ROWS = 100_000_000
CHUNK_SIZE = 10_000_000

N = N_ROWS // CHUNK_SIZE
dataframes = []
for i in range(N):
    print(f"{i+1}/{N}")
    chunk_data = np.random.lognormal(3., 10., int(CHUNK_SIZE)).astype(np.int32)
    chunk_ddf = dd.from_pandas(pd.DataFrame({'session_id': (chunk_data // 45), 'item_id': chunk_data}), npartitions=1)
    dataframes.append(chunk_ddf)

ddf = dd.concat(dataframes, axis=0)
del dataframes
gc.collect()

# !!! When `shuffle_by_keys` is commented out, the code finishes successfully
dataset = nvt.Dataset(ddf).shuffle_by_keys(keys=["session_id"])

_categorical_feats = [
    "item_id",
] >> nvt.ops.Categorify(
    freq_threshold=5,
    # !!! When `num_buckets=None`, the code finishes successfully
    num_buckets=100,
)

workflow = nvt.Workflow(_categorical_feats)
workflow.fit(dataset)
workflow.output_schema

Expected behavior
Properly fitted op.Categorify when num_buckets > 1 and the dataset is shuffled by keys.

Environment details (please complete the following information):

Environment location: JupyterLab in Docker on GCP
Method of NVTabular install: Docker

My Dockerfile:

# AFTER https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai
FROM nvcr.io/nvidia/merlin/merlin-pytorch:23.08

# Install Google Cloud SDK
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg  add - && apt-get update -y && apt-get install google-cloud-sdk -y

# Copy your project to the Docker image
COPY . /project
WORKDIR /project

# Install Python dependencies
RUN pip install -U pip
RUN pip install -r requirements/base.txt

# Run Jupyter Lab by default, with no authentication, on port 8080
EXPOSE 8080
CMD ["jupyter-lab", "--allow-root", "--ip=0.0.0.0", "--port=8080", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'"]

Additional context
I need to call shuffle_by_keys because I then do the GroupBy operation.

The text was updated successfully, but these errors were encountered:

piojanu added the bug Something isn't working label Sep 20, 2023

piojanu changed the title ~~[BUG] ops.Categorify frequency hashing rises RuntimeError when the dataset is shuffled by keys~~ [BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys Sep 20, 2023

piojanu mentioned this issue Nov 6, 2023

[QST] How to omit the Dataset.shuffle_by_keys step when exporting data from BigQuery to parquet #1862

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys #1864

[BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys #1864

piojanu commented Sep 20, 2023 •

edited

Loading

[BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys #1864

[BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys #1864

Comments

piojanu commented Sep 20, 2023 • edited Loading

piojanu commented Sep 20, 2023 •

edited

Loading