-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InvalidDataError: The provided data does not match the metadata #1570
Comments
Hi @wzker11, nice to meet you. I'm curious what happens if you remove the constraint. Does the error go away? The error message seems to indicate that the error is with the data itself, so I'm curious if it's the constraint that's causing it or something else. Would you be able to share the metadata that you have? I'm especially curious about the columns involved in the constraint ( |
Hi @npatki, thank you for your reply! |
@npatki To add more information, the metadata is |
Hi @wzker11, thanks for confirming. In this case, I would suggest to stick with the Since your data is Could you provide the output from running the below command?
|
Hi @npatki, thank you for your reply! This is the output for df.dtypes user_id object |
Hi @wzker11, no problem. Thank you for your quick responses as well. Based on this information, we were able to replicate the issue. It appears that both your datetime columns are represented as timezone aware (with respect to UTC). Timezone aware datetime objects are not currently supported in the SDV. As a next step, we will file a feature request to support timezone aware columns (or at the very least, clean up the error message). WorkaroundI'm curious how you are loading your dataset, and whether the timezone is something you are explicitly adding? A simple workaround would be to remove the timezone component for now, using the following command: df['start'] = df['start'].dt.tz_localize(None)
df['end'] = df['end'].dt.tz_localize(None) Since the entire column is represented with the same timezone, doing this should not have any significant effect on the quality of your synthetic data. Let us know if you have any questions! |
Hi @npatki, understood. I will try to remove the timezone and see whether it will work this time. And by the way, for this problem, It solved when I add the parameters batch_size=1000000,max_tries_per_batch = 100000. But may I know what batch_size and max_tires_per_batch should I set to get the whole output? Or I need to try different values each time? Looking forward to hearing from you! Thank you! |
Hi @wzker11, I just wanted to check to see if the In noticed you filed new issues for the custom constraints issues -- which is great because it's a slightly different topic. To the extent possible, we recommend using pre-defined constraints since they are pre-vetted and tested by the SDV team. Testing your custom code (and how it interacts with the SDV modeling) is a bit tougher. |
Hi @npatki, thank you for your reply! The Inequality works after you remove the timezone part of the datetime columns. I understand the pre-defined constraints may be more stable, but for my original dataset, it needs more complex constraints to make the synthetic data more meaningful. |
Hello. I'm encountering same issue, even when the datetime is time zone unaware. Please suggest possible workaround: number int64 |
Hi @vijayashree-kr, could you please file a new issue with what you're observing? This issue is only for timezone-aware columns and has already been fixed. Thanks. FYI In order to replicate your problem, it will also be helpful if you could include more information in the new issue that you file:
|
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
I add the constrain for end(datetime) is larger that start(datetime), and then synthesizer.fit(df). It shows the error: "InvalidDataError: The provided data does not match the metadata:
ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"
But I have removed the rows where 'end' or 'start' is NULL.
Steps to reproduce
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(
data=df
)
start_lessthan_end = {
'constraint_class': 'Inequality',
'constraint_parameters': {
'low_column_name': 'start',
'high_column_name': 'end'
}
}
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints([
start_lessthan_end
])
synthesizer.fit(df)
The text was updated successfully, but these errors were encountered: