Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError while ctgan.fit() #326

Closed
AT9991 opened this issue Jan 5, 2024 · 6 comments
Closed

TypeError while ctgan.fit() #326

AT9991 opened this issue Jan 5, 2024 · 6 comments
Labels
bug Something isn't working resolution:duplicate This issue or pull request already exists

Comments

@AT9991
Copy link

AT9991 commented Jan 5, 2024

Environment Details

Google Colab

Error Description

TypeError Traceback (most recent call last)
in <cell line: 1>()
----> 1 ctgan.fit(trial)

6 frames
/usr/local/lib/python3.10/dist-packages/rdt/transformers/base.py in _set_seed(self, data)
365 hash_value = self.columns[0]
366 for value in data.head(5):
--> 367 hash_value += str(value)
368
369 hash_value = int(hashlib.sha256(hash_value.encode('utf-8')).hexdigest(), 16)

TypeError: unsupported operand type(s) for +=: 'int' and 'str'

Steps to reproduce

!pip install ctgan
from ctgan import CTGAN
data = pd.read_csv(...)
ctgan = CTGAN(epochs=100)
ctgan.fit(data)

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
@AT9991 AT9991 added bug Something isn't working new Label applied to new issues labels Jan 5, 2024
@aarishmaqsood
Copy link

I was facing the same problem. There may be a problem with your column names, they should be strings.

@npatki
Copy link
Contributor

npatki commented Apr 16, 2024

Hi @AT9991 and @aarishmaqsood, would either of you be able to share some CSV data that we can use to replicate this?

BTW instead of using the CTGAN library directly, I would highly recommend using the SDV library. You can access the CTGAN Synthesizer via SDV. Doing so will allow you to make use of additional features -- such as better data pre-processing, customizations such as constraints, and conditional sampling.

I actually wonder whether you would still encounter this bug in SDV, since there is a lot more data validation and checking we do there. Here is a tutorial that uses CTGAN via the SDV library.

@npatki npatki added under discussion Issue is currently being discussed and removed new Label applied to new issues labels Apr 16, 2024
@aarishmaqsood
Copy link

@npatki Thank you for your response. I have fixed my problem. In the future I will use your suggested solution.

@npatki
Copy link
Contributor

npatki commented Apr 17, 2024

Great to hear @aarishmaqsood. Could you describe what fixed your problem? In case other others have the same issue, I can refer them here. Thanks.

@aarishmaqsood
Copy link

aarishmaqsood commented Apr 18, 2024

@npatki Here is the Colab link, where I have replicated the error and provided the solution as well. This problem occurs in version 1.5.0. Below are the code snippets that illustrate both the problem and the solution.

Reproducing the Error

!pip install sdv==1.5.0

import numpy as np
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer

# Generate sample data
num_rows = 100
num_cols = 20
data = {i+1: np.random.randint(0, 100, size=num_rows) for i in range(num_cols)}
df = pd.DataFrame(data)

# create metadata from the DataFrame
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

# Initialize the synthesizer (this is where the error occurs)
synthesizer = CTGANSynthesizer(metadata=metadata)

Solution

# Convert column names to strings
df.columns = ['col_' + str(i) for i in range(1, len(df.columns) + 1)]

# Re-create metadata for the table with corrected column names
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

# Initialize the synthesizer with corrected metadata
synthesizer = CTGANSynthesizer(metadata=metadata)

@npatki
Copy link
Contributor

npatki commented Apr 18, 2024

Hi @aarishmaqsood, very much appreciate the detailed code and notebook.

Note that I have replicated this issue on the latest SDV (1.12.0) also. Here are a few things I discovered:

  1. The metadata auto-detection no longer works on SDV 1.12.0. I have filed an issue for it at SDV #1933
  2. The fit problem isn't isolated to CTGAN. None of the SDV synthesizers work with this type of data and all produce the same error. I have filed a generic issue at SDV #1935

Since we now have the above two issues filed in our main SDV library, I will mark this one as a duplicate.

In the meantime, for anyone else running into the issue, I suggest using @aarishmaqsood 's simple workaround that converts the column names from integers to strings.

Thanks all for helping uncover this. For any related discussion, please feel free to comment on either of the SDV issues linked above.

@npatki npatki closed this as completed Apr 18, 2024
@npatki npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

3 participants