Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imputation with nan's produces loss to be nan #41

Open
watersoup opened this issue Feb 28, 2023 · 6 comments
Open

Imputation with nan's produces loss to be nan #41

watersoup opened this issue Feb 28, 2023 · 6 comments

Comments

@watersoup
Copy link

Hi George,
Imputation using this package is bit confusion, tried keeping nan's for to be imputed, but my loss is obviously 1, cannot use 0 because it can be legitimate value. I tried -1 but seems like the its throwing ZeroDivisionError after 1st epoch.
I can give you my data i.e. wastewater class.
Here is my json config file
{
"data_dir": "\\Hscpigdcapmdw05\sas\Use....\inputdata",
"output_dir": "\\Hscpigdcapmdw05\sas\Use...\mvts_imputed",
"model" : "transformer",
"data_class": "wastewater",
"task" : "imputation",
"d_model": 64,
"activation" : "relu",
"num_heads" : 4,
"num_layers": 8,
"pos_encoding": "learnable",
"epochs" : 10,
"normalization": "minmax",
"test_ratio" : 0.1,
"val_ratio": 0.05,
"mean_mask_length": 6,
"mask_mode": "concurrent",
"mask_distribution": "bernoulli",
"exclude_feats" : ["geoLat", "geoLong", "phureg", "sewershedPop"],
"data_window_len": 15,
"lr": 0.001,
"batch_size": 5,
"masking_ratio": 0.05
}

@watersoup
Copy link
Author

I think this has been fixed, this new jason, and making sure to have NAN's as np.nan.
{
"data_dir": "../inputdata",
"output_dir": "../mvts_imputed",
"model" : "transformer",
"data_class": "wastewater",
"records_file" : "../mvts_imputed/ImputationRecords.xls",
"print_interval" : 10,

"mean_mask_length": 6,
"mask_mode": "concurrent",
"mask_distribution": "bernoulli",

"task" : "imputation",
"no_timestamp" : 1 ,
"d_model": 128,
"activation" : "relu",
"num_heads" : 8,
"num_layers": 8,
"pos_encoding": "learnable",
"epochs" : 200,
"normalization": "minmax",

"val_ratio": 0.2,
"val_interval": 5,

"exclude_feats" : ["geoLat", "geoLong", "phureg", "sewershedPop"],
"data_window_len": 15,
"batch_size": 10

}

@gzerveas
Copy link
Owner

Great to see that you fixed your issue. Did you use np.nan as a masking value?

@watersoup
Copy link
Author

Great to see that you fixed your issue. Did you use np.nan as a masking value?

Hi George,
Yes i had to make use of np.nan for missing values in the dataset, and increasing validation ratio from 5% to 20%
Jag

@watersoup
Copy link
Author

Hi George,
Seems like that, np.nan is not actually working. I am not sure why it worked earlier. Is there any other way we can actually make the missing work ? or can you tell me where and how can let it adapt the missingness in the training data can be dropped while training ?
Thanks
Jag

@gzerveas
Copy link
Owner

gzerveas commented Apr 5, 2023

Ok, let me clarify a couple of things:
When training, you need to have complete data (no missing values), and you simply train with the self-supervised imputation objective. During inference, you must know the indices of the missing values (i.e. which values are actually missing), but what values you use as fillers is up to you. You only need to ensure that the filler values between training and inference are consistent.

You therefore don't need NaN values to represent missingness (and they can easily cause trouble). Missingness is represented with the boolean mask, (seq_length, feat_dim) array for each sample, which corresponds to the missing indices. This mask is produced by noise_mask within the __get_item__ of ImputationDataset during training, but during inference must be provided by you, i.e. by your data.py class and then passed on to a Dataset class that can be a subclass of ImputationDataset almost identical to the original, but simply gets the masks directly from your data.py class instead of calling noise_mask to generate them on the fly. These masks are then used within the collate function to define the target_masks (you don't need to change these) , and to transform the X features tensor. Currently, this transformation enters zeroes where the values are missing (these are the zeroes of the boolean mask). However, if you think zeroes won't work for you (I suggest you first give it a try), you can set the corresponding elements of X to an arbitrary value outside the range of your features, e.g. -77. Again, just make sure that you do this consistently both for training as well as for inference - this simply means changing this line within collate_unsuperv.

@watersoup
Copy link
Author

watersoup commented Apr 14, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants