Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communicate the training progress during fit #579

Closed
EduardoPassaro opened this issue Sep 8, 2021 · 4 comments
Closed

Communicate the training progress during fit #579

EduardoPassaro opened this issue Sep 8, 2021 · 4 comments
Labels
feature:performance Related to time or memory usage feature request Request for a new feature resolution:duplicate This issue or pull request already exists

Comments

@EduardoPassaro
Copy link

EduardoPassaro commented Sep 8, 2021

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 0.12.0
  • Python version: 3.8
  • Operating System: Windows Server 2016
  • RAM 64GB
  • CPU 3.3GHz

Problem description

I executed sdv.fit(meta, tables) but the run is in progress for over 12 hrs. Is that expected, is there any simplification I could do?
Relational model
10 tables
100 records
90 columns distributed across the tables
Mix of categorical, date (%Y-%m-%d) and float datatypes

AA =  pd.read_csv('C:\AA.csv')
BB =  pd.read_csv('C:\BB.csv')
CC =  pd.read_csv('C:\CC.csv')  
DD =  pd.read_csv('C:\DD.csv')
EE =  pd.read_csv('C:\EE.csv')
FF =  pd.read_csv('C:\FF.csv')
GG =  pd.read_csv('C:\GG.csv')
HH =  pd.read_csv('C:\HH.csv')
II =  pd.read_csv('C:\II.csv')
JJ =  pd.read_csv('C:\JJ.csv')

AA['aa_date'] = pd.to_datetime(AA['aa_date'], format= '%Y-%m-%d')
BB['bb_date'] = pd.to_datetime(BB['bb_date'], format= '%Y-%m-%d')

meta = Metadata()
meta.add_table(
   name='AA',
   data=AA,
   primary_key='AA_id'
   )
meta.add_table(
   name='BB',
   data=BB,
   primary_key='BB_id',
   parent='AA',
   foreign_key='AA_id'
   )
meta.add_table(
   name='CC',
   data=CC,
   primary_key='BB_id',
   parent='AA',
   foreign_key='AA_id'
   )
meta.add_table(
   name='DD',
   data=DD,
   primary_key='DD_id'
   )
meta.add_table(
   name='EE',
   data=EE,
   primary_key='EE_id'
   )
meta.add_table(
   name='FF',
   data=FF,
   primary_key='FF_id'
   )
meta.add_table(
   name='GG',
   data=GG,
   primary_key='GG_id',
   )
meta.add_table(
   name='HH',
   data=HH,
   primary_key='AA_id',
   parent='AA'
   )   
meta.add_table(
   name='II',
   data=II,
   primary_key='AA_id',
   parent='AA'
   )  
meta.add_table(
   name='JJ',
   data=JJ,
   primary_key='JJ_id'
   )
meta.add_relationship(parent='DD', child='AA', foreign_key='DD_id', validate=True)
meta.add_relationship(parent='EE', child='DD', foreign_key='EE_id', validate=True)
meta.add_relationship(parent='FF', child='DD', foreign_key='FF_id', validate=True)
meta.add_relationship(parent='JJ', child='DD', foreign_key='JJ_id', validate=True)
meta.add_relationship(parent='GG', child='AA', foreign_key='GG_id', validate=True)

tables = {
   'AA': AA,
   'BB': BB,
   'CC': CC,  
   'DD': DD,
   'EE': EE,
   'FF': FF,
   'GG': GG,
   'HH': HH,
   'II': II,
   'JJ': JJ
   }
   
from sdv import SDV
sdv = SDV()
sdv.fit(meta, tables)

There is no output, The model is running for 12+ hrs

@EduardoPassaro EduardoPassaro added pending review question General question about the software labels Sep 8, 2021
@katxiao
Copy link
Contributor

katxiao commented Sep 20, 2021

Hi @EduardoPassaro, thanks for raising this issue! This definitely highlights the need for some notion of progress to be communicated during the fit process.

In order to help us understand your use case, could you share some sample data of the tables you are trying to model? It would be helpful to see the column types in each table and examples of the column data.

Additionally, if you're interested in another way of communicating with the sdv community, you could check out our slack workspace!

@katxiao katxiao added under discussion Issue is currently being discussed and removed pending review labels Sep 20, 2021
@npatki
Copy link
Contributor

npatki commented Jun 1, 2022

Hi! Since this issue is stale now, I'm removing the "under discussion" label and repurposing it to a feature request: Communicate the training progress during fit

@npatki npatki added feature request Request for a new feature and removed question General question about the software under discussion Issue is currently being discussed labels Jun 1, 2022
@npatki npatki changed the title sdv.fit performance Communicate the training progress during fit Jun 1, 2022
@npatki
Copy link
Contributor

npatki commented Jul 22, 2022

FYI some related issues:

@npatki npatki added the feature:performance Related to time or memory usage label Jul 22, 2022
@npatki
Copy link
Contributor

npatki commented Jun 1, 2023

Good news! The team is actively working on this feature and hoping to include soon in a future SDV release.

For more details see #1440. I'm closing this issue as a dupe of #1440.

@npatki npatki closed this as completed Jun 1, 2023
@npatki npatki added the resolution:duplicate This issue or pull request already exists label Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:performance Related to time or memory usage feature request Request for a new feature resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

3 participants