Memory Usage - Gaussian Copula Trained Model consuming high memory when generating synthetic data #304

AnupamaGangadhar · 2021-01-27T08:43:44Z

Environment details

SDV version: sdv==0.4.5
Python version: Python 3.7.4
Operating System: Catalina 10.15.7 (19H2)

Problem description

The trained model is unable to generate synthetic data for certain sample size

What I already tried

Able to train the model
Unable to generate synthetic data - python process is killed before completion

from sdv import SDV
from sdv import Metadata
from sdv.tabular import GaussianCopula
import json
import os
import logging
import pandas as pd
import numpy as np

@profile
def my_func():
	model_file = "./100.pkl"
	sample_size = 5000
	model = GaussianCopula.load(model_file)
	syndata = model.sample(sample_size)
	#syndata.to_csv(syn_csv_file, header=True,index=False)

if __name__ == '__main__':
	my_func()

$python -m memory_profiler testdatagen.py

Fails at 5000 records
Successful run mem profile for 2000 records is given below

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    10  115.004 MiB  115.004 MiB           1   @profile
    11                                         def my_func():
    12  115.004 MiB    0.000 MiB           1   	model_file = "/tmp/100.pkl"
    13  115.004 MiB    0.000 MiB           1   	sample_size = 2000
    14  115.004 MiB    0.000 MiB           1   	syn_csv_file = "/tmp/100.csv"
    15  403.266 MiB  288.262 MiB           1   	model = GaussianCopula.load(model_file)
    16  138.746 MiB -264.520 MiB           1   	syndata = model.sample(sample_size)

data used for training
json of below format - 500 records

        {
          "row": {
            "ID": "491",
            "Card Type Code": "JC",
            "Card Type Full Name": "xxx",
            "Issuing Bank": "xxx",
            "Card Number": "354268752674xxxx",
            "Card Holder's Name": "yyy yyyyy",
            "Issue Date": "xx/xxxx",
            "Expiry Date": "xx/xxxx",
            "Billing Date": "20",
            "Card PIN": "xxxx",
            "Credit Limit": "102300",
            "Age": "xx",
            "SSN": "xxx-xx-xxxx",
            "JobTitle": "Software Engineer",
            "Additional Details": "xxxx 1510 7973 1294 45 [email protected] 509-xxx-8996 sixxxlemachines.xxx Wolf:7009751759485477"
          }
        }

I am able to generate the synthetic data using CTGAN model. Given below is the memory usage

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    11  191.926 MiB  191.926 MiB           1   @profile
    12                                         def my_func():
    13  191.926 MiB    0.000 MiB           1   model_file = "/tmp/102.pkl"
    14  191.926 MiB    0.000 MiB           1   sample_size = 10000
    15  191.926 MiB    0.000 MiB           1   syn_csv_file = "/tmp/102.csv"
    16 1123.863 MiB  931.938 MiB           1   model = CTGAN.load(model_file)
    17  678.570 MiB -445.293 MiB           1   syndata = model.sample(sample_size)
    18  679.195 MiB    0.625 MiB           1   syndata.to_csv(syn_csv_file, header=True,index=False)

One of the papers I read about CTGAN says
A Gaussian copula with appropriate margins generates the features, and the different parts of the development process are modeled with successive neural nets. The simulation machine accommodates only a few covariates; the generation of a large number of features with the Gaussian copula could lead to unrealistic combinations of factor levels.

Could this be the reason for the behaviour seen?

The text was updated successfully, but these errors were encountered:

pvk-developer · 2021-03-12T10:40:40Z

Hi @AnupamaGangadhar , thank you for reporting this issue!

I have been able to reproduce the problem when the dataset has a large number of unique categorical values like yours. Here is a screen capture of the memory usage while sampling in such scenario, which is the same that you are describing.

We are working on a fix for this on RDT to reduce the memory ussage (sdv-dev/RDT#156), but in the meantime I recommend you to change the categorical transformer to categorical instead of the default one_hot_encoding.

This may slightly reduce how well the model learns the correlations with some of the categorical columns, but completely get rid of the memory usage problem.

model = GaussianCopula(categorical_transformer='categorical')
model.fit(data)
model.sample()

pvk-developer · 2021-03-16T18:21:20Z

Hi @AnupamaGangadhar , we have solved the issue sdv-dev/RDT#156 on RDT, and this problem has been fixed as you can observe on my screenshot (the process kept the same memory ram usage while fitting and sampling, we can observe only a small increase while fitting which then decreases and no increase when sampling).

This issue will be solved with the next release. Meanwhile, you can install the RDT's release candidate to try it out.

If the issue persists, please feel free to reopen it.

AnupamaGangadhar added pending review question General question about the software labels Jan 27, 2021

pvk-developer mentioned this issue Mar 11, 2021

HyperTransformer: Memory usage increase when reverse_transform is called sdv-dev/RDT#156

Closed

csala removed the pending review label Mar 16, 2021

pvk-developer added this to the 0.9.0 milestone Mar 16, 2021

pvk-developer closed this as completed Mar 16, 2021

csala added bug Something isn't working and removed question General question about the software labels Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Usage - Gaussian Copula Trained Model consuming high memory when generating synthetic data #304

Memory Usage - Gaussian Copula Trained Model consuming high memory when generating synthetic data #304

AnupamaGangadhar commented Jan 27, 2021 •

edited by csala

Loading

pvk-developer commented Mar 12, 2021

pvk-developer commented Mar 16, 2021

Memory Usage - Gaussian Copula Trained Model consuming high memory when generating synthetic data #304

Memory Usage - Gaussian Copula Trained Model consuming high memory when generating synthetic data #304

Comments

AnupamaGangadhar commented Jan 27, 2021 • edited by csala Loading

Environment details

Problem description

What I already tried

pvk-developer commented Mar 12, 2021

pvk-developer commented Mar 16, 2021

AnupamaGangadhar commented Jan 27, 2021 •

edited by csala

Loading