You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the papers I read about CTGAN says
A Gaussian copula with appropriate margins generates the features, and the different parts of the development process are modeled with successive neural nets. The simulation machine accommodates only a few covariates; the generation of a large number of features with the Gaussian copula could lead to unrealistic combinations of factor levels.
Could this be the reason for the behaviour seen?
The text was updated successfully, but these errors were encountered:
I have been able to reproduce the problem when the dataset has a large number of unique categorical values like yours. Here is a screen capture of the memory usage while sampling in such scenario, which is the same that you are describing.
We are working on a fix for this on RDT to reduce the memory ussage (sdv-dev/RDT#156), but in the meantime I recommend you to change the categorical transformer to categorical instead of the default one_hot_encoding.
This may slightly reduce how well the model learns the correlations with some of the categorical columns, but completely get rid of the memory usage problem.
Hi @AnupamaGangadhar , we have solved the issue sdv-dev/RDT#156 on RDT, and this problem has been fixed as you can observe on my screenshot (the process kept the same memory ram usage while fitting and sampling, we can observe only a small increase while fitting which then decreases and no increase when sampling).
This issue will be solved with the next release. Meanwhile, you can install the RDT's release candidate to try it out.
If the issue persists, please feel free to reopen it.
Environment details
Problem description
The trained model is unable to generate synthetic data for certain sample size
What I already tried
Able to train the model
Unable to generate synthetic data - python process is killed before completion
Fails at 5000 records
Successful run mem profile for 2000 records is given below
data used for training
json of below format - 500 records
I am able to generate the synthetic data using CTGAN model. Given below is the memory usage
One of the papers I read about CTGAN says
A Gaussian copula with appropriate margins generates the features, and the different parts of the development process are modeled with successive neural nets. The simulation machine accommodates only a few covariates; the generation of a large number of features with the Gaussian copula could lead to unrealistic combinations of factor levels.
Could this be the reason for the behaviour seen?
The text was updated successfully, but these errors were encountered: