Imputation with nan's produces loss to be nan #41

watersoup · 2023-02-28T17:32:30Z

Hi George,
Imputation using this package is bit confusion, tried keeping nan's for to be imputed, but my loss is obviously 1, cannot use 0 because it can be legitimate value. I tried -1 but seems like the its throwing ZeroDivisionError after 1st epoch.
I can give you my data i.e. wastewater class.
Here is my json config file
{
"data_dir": "\\Hscpigdcapmdw05\sas\Use....\inputdata",
"output_dir": "\\Hscpigdcapmdw05\sas\Use...\mvts_imputed",
"model" : "transformer",
"data_class": "wastewater",
"task" : "imputation",
"d_model": 64,
"activation" : "relu",
"num_heads" : 4,
"num_layers": 8,
"pos_encoding": "learnable",
"epochs" : 10,
"normalization": "minmax",
"test_ratio" : 0.1,
"val_ratio": 0.05,
"mean_mask_length": 6,
"mask_mode": "concurrent",
"mask_distribution": "bernoulli",
"exclude_feats" : ["geoLat", "geoLong", "phureg", "sewershedPop"],
"data_window_len": 15,
"lr": 0.001,
"batch_size": 5,
"masking_ratio": 0.05
}

watersoup · 2023-03-22T17:38:20Z

I think this has been fixed, this new jason, and making sure to have NAN's as np.nan.
{
"data_dir": "../inputdata",
"output_dir": "../mvts_imputed",
"model" : "transformer",
"data_class": "wastewater",
"records_file" : "../mvts_imputed/ImputationRecords.xls",
"print_interval" : 10,

"mean_mask_length": 6,
"mask_mode": "concurrent",
"mask_distribution": "bernoulli",

"task" : "imputation",
"no_timestamp" : 1 ,
"d_model": 128,
"activation" : "relu",
"num_heads" : 8,
"num_layers": 8,
"pos_encoding": "learnable",
"epochs" : 200,
"normalization": "minmax",

"val_ratio": 0.2,
"val_interval": 5,

"exclude_feats" : ["geoLat", "geoLong", "phureg", "sewershedPop"],
"data_window_len": 15,
"batch_size": 10

}

gzerveas · 2023-03-23T05:39:53Z

Great to see that you fixed your issue. Did you use np.nan as a masking value?

watersoup · 2023-03-23T18:13:53Z

Great to see that you fixed your issue. Did you use np.nan as a masking value?

Hi George,
Yes i had to make use of np.nan for missing values in the dataset, and increasing validation ratio from 5% to 20%
Jag

watersoup · 2023-04-03T22:02:19Z

Hi George,
Seems like that, np.nan is not actually working. I am not sure why it worked earlier. Is there any other way we can actually make the missing work ? or can you tell me where and how can let it adapt the missingness in the training data can be dropped while training ?
Thanks
Jag

gzerveas · 2023-04-05T21:05:37Z

Ok, let me clarify a couple of things:
When training, you need to have complete data (no missing values), and you simply train with the self-supervised imputation objective. During inference, you must know the indices of the missing values (i.e. which values are actually missing), but what values you use as fillers is up to you. You only need to ensure that the filler values between training and inference are consistent.

You therefore don't need NaN values to represent missingness (and they can easily cause trouble). Missingness is represented with the boolean mask, (seq_length, feat_dim) array for each sample, which corresponds to the missing indices. This mask is produced by noise_mask within the __get_item__ of ImputationDataset during training, but during inference must be provided by you, i.e. by your data.py class and then passed on to a Dataset class that can be a subclass of ImputationDataset almost identical to the original, but simply gets the masks directly from your data.py class instead of calling noise_mask to generate them on the fly. These masks are then used within the collate function to define the target_masks (you don't need to change these) , and to transform the X features tensor. Currently, this transformation enters zeroes where the values are missing (these are the zeroes of the boolean mask). However, if you think zeroes won't work for you (I suggest you first give it a try), you can set the corresponding elements of X to an arbitrary value outside the range of your features, e.g. -77. Again, just make sure that you do this consistently both for training as well as for inference - this simply means changing this line within collate_unsuperv.

watersoup · 2023-04-14T15:07:32Z

HI George, Greatly appreciated your elaborate email, I will update my methodology and check it out today. Thanks, Jag

…

________________________________ From: George Zerveas ***@***.***> Sent: April 5, 2023 9:05 PM To: gzerveas/mvts_transformer ***@***.***> Cc: watersoup ***@***.***>; Author ***@***.***> Subject: Re: [gzerveas/mvts_transformer] Imputation with nan's produces loss to be nan (Issue #41) Ok, let me clarify a couple of things: When training, you need to have complete data (no missing values), and you simply train with the self-supervised imputation objective. During inference, you must know the indices of the missing values (i.e. which values are actually missing), but what values you use as fillers is up to you. You only need to ensure that the filler values between training and inference are consistent. You therefore don't need NaN values to represent missingness (and they can easily cause trouble). Missingness is represented with the boolean mask which corresponds to the missing indices. This mask is produced by noise_mask within the __get_item__ of ImputationDataset during training, but during inference must be provided by you, i.e. by your data.py class and then passed on to a Dataset class that can be a subclass of ImputationDataset almost identical to the original, but simply gets the masks directly from your data.py class instead of calling noise_mask to generate them. These masks are then used within the collate function<https://github.com/gzerveas/mvts_transformer/blob/3f2e378bc77d02e82a44671f20cf15bc7761671a/src/datasets/dataset.py#L210> to define the target_masks (you don't need to change these) , and to transform the X features tensor<https://github.com/gzerveas/mvts_transformer/blob/3f2e378bc77d02e82a44671f20cf15bc7761671a/src/datasets/dataset.py#L225>. Currently, this transformation enters zeroes where the values are missing (these are the zeroes of the boolean mask). However, if you think zeroes won't work for you (I suggest you first give it a try), you can set the corresponding elements of X to an arbitrary value outside the range of your features, e.g. -77. Again, just make sure that you do this consistently both for training as well as for inference - this simply means changing this line<https://github.com/gzerveas/mvts_transformer/blob/3f2e378bc77d02e82a44671f20cf15bc7761671a/src/datasets/dataset.py#L225> within collate_unsupev. — Reply to this email directly, view it on GitHub<#41 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADT3T3A3MFM6Q4N427LLOIDW7XNCZANCNFSM6AAAAAAVLAS2SE>. You are receiving this because you authored the thread.Message ID: ***@***.***>

gzerveas mentioned this issue Apr 5, 2023

question about masking #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imputation with nan's produces loss to be nan #41

Imputation with nan's produces loss to be nan #41

watersoup commented Feb 28, 2023

watersoup commented Mar 22, 2023

gzerveas commented Mar 23, 2023

watersoup commented Mar 23, 2023

watersoup commented Apr 3, 2023

gzerveas commented Apr 5, 2023 •

edited

Loading

watersoup commented Apr 14, 2023 via email

Imputation with nan's produces loss to be nan #41

Imputation with nan's produces loss to be nan #41

Comments

watersoup commented Feb 28, 2023

watersoup commented Mar 22, 2023

gzerveas commented Mar 23, 2023

watersoup commented Mar 23, 2023

watersoup commented Apr 3, 2023

gzerveas commented Apr 5, 2023 • edited Loading

watersoup commented Apr 14, 2023 via email

gzerveas commented Apr 5, 2023 •

edited

Loading