Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] PyTorch NVT Data Loader - Not possible to know the column names of non-list columns #499

Closed
gabrielspmoreira opened this issue Dec 16, 2020 · 0 comments · Fixed by #793
Assignees
Labels

Comments

@gabrielspmoreira
Copy link
Member

Describe the bug
When the PyTorch data loader reads a parquet file which has both columns with lists and "simple" (not-lists) columns, it is not possible to know what are the column names for the "simple" features.

Steps/Code to reproduce bug

Read a parquet file that has both list columns and simple columns for categorical columns (could also be continuous columns, the problem is the same)

from nvtabular.loader.torch import TorchAsyncItr as NVTDataLoader
from nvtabular import Dataset as NVTDataset

data_loader_config = {
                "cats": ['simple_col1', 'list_col2', 'simple_col3'],
                "conts": [],
                "labels": [],
            }

train_set = NVTDataset(train_data_path, engine="parquet", part_mem_fraction=0.1)
train_loader = NVTDataLoaderWrapper(train_set, batch_size=10, 
                                            shuffle=False, **data_loader_config )


cat_features, cont_features, label_features = train_loader .__next__()
cat_single_features, cat_sequence_features = cat_features

The cat_sequence_features will have a dictionary with the key list_col2 and the tensor as a value. The cat_single_features will have a tensor with 2 dimensions, one for simple_col1 and other for simple_col3.
The problem is that the data loader does not provide a way to know the column names corresponding to cat_single_features dimensions.
I have checked the train_set.cat_names and train_set.cont_names, but they do not correspond only to the ```cat_single_features``, because they also contain the list column names.

Expected behavior
It would be better if cat_single_features could also be a dict of tensors. If there is a relevant performance penalty for doing so, the data loader should provide a property with the cat_names corresponding to the cat_single_features tensor dimensions.

Environment details (please complete the following information):

  • NVTabular 0.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants