[BUG] PyTorch NVT Data Loader - Not possible to know the column names of non-list columns #499

gabrielspmoreira · 2020-12-16T02:34:55Z

Describe the bug
When the PyTorch data loader reads a parquet file which has both columns with lists and "simple" (not-lists) columns, it is not possible to know what are the column names for the "simple" features.

Steps/Code to reproduce bug

Read a parquet file that has both list columns and simple columns for categorical columns (could also be continuous columns, the problem is the same)

from nvtabular.loader.torch import TorchAsyncItr as NVTDataLoader
from nvtabular import Dataset as NVTDataset

data_loader_config = {
                "cats": ['simple_col1', 'list_col2', 'simple_col3'],
                "conts": [],
                "labels": [],
            }

train_set = NVTDataset(train_data_path, engine="parquet", part_mem_fraction=0.1)
train_loader = NVTDataLoaderWrapper(train_set, batch_size=10, 
                                            shuffle=False, **data_loader_config )


cat_features, cont_features, label_features = train_loader .__next__()
cat_single_features, cat_sequence_features = cat_features

The cat_sequence_features will have a dictionary with the key list_col2 and the tensor as a value. The cat_single_features will have a tensor with 2 dimensions, one for simple_col1 and other for simple_col3.
The problem is that the data loader does not provide a way to know the column names corresponding to cat_single_features dimensions.
I have checked the train_set.cat_names and train_set.cont_names, but they do not correspond only to the ```cat_single_features``, because they also contain the list column names.

Expected behavior
It would be better if cat_single_features could also be a dict of tensors. If there is a relevant performance penalty for doing so, the data loader should provide a property with the cat_names corresponding to the cat_single_features tensor dimensions.

Environment details (please complete the following information):

NVTabular 0.3

The text was updated successfully, but these errors were encountered:

gabrielspmoreira added the bug Something isn't working label Dec 16, 2020

gabrielspmoreira mentioned this issue Dec 16, 2020

[FEA] Abstract PyTorch Data loader to provide Sparse Tensors for list columns #500

Closed

gabrielspmoreira mentioned this issue Mar 4, 2021

[FEA] Session-based recommendation support #355

Closed

viswa-nvidia added this to the NVTabular v0.6 milestone Apr 26, 2021

benfred assigned jperez999 May 12, 2021

benfred added PyTorch and removed bug Something isn't working labels May 12, 2021

jperez999 linked a pull request May 24, 2021 that will close this issue

Add Sparse Representation Capability to Dataloaders #793

Merged

jperez999 closed this as completed in #793 Jun 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] PyTorch NVT Data Loader - Not possible to know the column names of non-list columns #499

[BUG] PyTorch NVT Data Loader - Not possible to know the column names of non-list columns #499

gabrielspmoreira commented Dec 16, 2020

[BUG] PyTorch NVT Data Loader - Not possible to know the column names of non-list columns #499

[BUG] PyTorch NVT Data Loader - Not possible to know the column names of non-list columns #499

Comments

gabrielspmoreira commented Dec 16, 2020