Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidDataError: The provided data does not match the metadata #1570

Closed
wzker11 opened this issue Sep 6, 2023 · 12 comments
Closed

InvalidDataError: The provided data does not match the metadata #1570

wzker11 opened this issue Sep 6, 2023 · 12 comments
Labels
bug Something isn't working resolution:duplicate This issue or pull request already exists

Comments

@wzker11
Copy link

wzker11 commented Sep 6, 2023

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version:1.3.0
  • Python version:3.11.3
  • Operating System:Windows

Error Description

I add the constrain for end(datetime) is larger that start(datetime), and then synthesizer.fit(df). It shows the error: "InvalidDataError: The provided data does not match the metadata:

ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"
But I have removed the rows where 'end' or 'start' is NULL.

Steps to reproduce

from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(
data=df
)
start_lessthan_end = {
'constraint_class': 'Inequality',
'constraint_parameters': {
'low_column_name': 'start',
'high_column_name': 'end'
}
}
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints([
start_lessthan_end
])
synthesizer.fit(df)

---------------------------------------------------------------------------
InvalidDataError                          Traceback (most recent call last)
Cell In[82], line 1
----> 1 synthesizer.fit(df)

File [c:\Users\wz\Anaconda3\Lib\site-packages\sdv\single_table\base.py:488](file:///C:/Users/wz/Anaconda3/Lib/site-packages/sdv/single_table/base.py:488), in BaseSynthesizer.fit(self, data)
    486 self._data_processor.reset_sampling()
    487 self._random_state_set = False
--> 488 processed_data = self._preprocess(data)
    489 self.fit_processed_data(processed_data)

File [c:\Users\wz\Anaconda3\Lib\site-packages\sdv\single_table\base.py:432](file:///C:/Users/wz/Anaconda3/Lib/site-packages/sdv/single_table/base.py:432), in BaseSynthesizer._preprocess(self, data)
    431 def _preprocess(self, data):
--> 432     self.validate(data)
    433     self._data_processor.fit(data)
    434     return self._data_processor.transform(data)

File [c:\Users\wz\Anaconda3\Lib\site-packages\sdv\single_table\base.py:260](file:///C:/Users/wz/Anaconda3/Lib/site-packages/sdv/single_table/base.py:260), in BaseSynthesizer.validate(self, data)
    257     errors += self._validate_column(data[column])
    259 if errors:
--> 260     raise InvalidDataError(errors)

InvalidDataError: The provided data does not match the metadata:

ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

@wzker11 wzker11 added bug Something isn't working new Automatic label applied to new issues labels Sep 6, 2023
@npatki
Copy link
Contributor

npatki commented Sep 8, 2023

Hi @wzker11, nice to meet you.

I'm curious what happens if you remove the constraint. Does the error go away?

The error message seems to indicate that the error is with the data itself, so I'm curious if it's the constraint that's causing it or something else.

Would you be able to share the metadata that you have? I'm especially curious about the columns involved in the constraint ('start' and 'end').

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Sep 8, 2023
@wzker11
Copy link
Author

wzker11 commented Sep 11, 2023

Hi @npatki, thank you for your reply!
The error goes away if I remove the constraints. The 'start' and 'end' is the start_time and end_time of trips, which are datetime type.
And it works if I use 'create_custom_constraint_class' to create the constraint instead of 'Inequality'. However, it appears another problem. When I generating the synthetic samples by using synthetic_data_custom_constraint = synthesizer.sample(10000), it stops with generating only 32 samples and no error shows. Could you please help me on that?
Screenshot 2023-09-11 105030

@wzker11
Copy link
Author

wzker11 commented Sep 11, 2023

@npatki To add more information, the metadata is
user_id : id
start : datetime
end : datetime
location_significance : categorical
latitude : numerical
longitude : numerical
address_country : categorical
address_city : categorical
address_city_type : categorical
address_streetname : categorical
location_place_name : categorical
location_place_type : categorical
duration : numerical
weekdays : numerical
hour_of_day_gmt_7 : numerical

@npatki
Copy link
Contributor

npatki commented Sep 11, 2023

Hi @wzker11, thanks for confirming. In this case, I would suggest to stick with the Inequality constraint since it is intended exactly for this logic. So let's debug why the Inequality is failing.

Since your data is datetime sdtype, I'm curious about how they are represented in-memory (in the Python dateframe itself). Sometimes, there are incompatibilities between how different data science libraries represent datetime object (numpy, pandas, etc.).

Could you provide the output from running the below command?

df.dtypes

@wzker11
Copy link
Author

wzker11 commented Sep 12, 2023

Hi @npatki, thank you for your reply! This is the output for df.dtypes

user_id object
start datetime64[ns, UTC]
end datetime64[ns, UTC]
location_significance object
latitude float64
longitude float64
address_country object
address_city object
address_city_type object
address_streetname object
location_place_name object
location_place_type object
duration int64
weekdays float64
hour_of_day_gmt_7 int64
dtype: object

@npatki
Copy link
Contributor

npatki commented Sep 12, 2023

Hi @wzker11, no problem. Thank you for your quick responses as well.

Based on this information, we were able to replicate the issue.

It appears that both your datetime columns are represented as timezone aware (with respect to UTC). Timezone aware datetime objects are not currently supported in the SDV.

As a next step, we will file a feature request to support timezone aware columns (or at the very least, clean up the error message).

Workaround

I'm curious how you are loading your dataset, and whether the timezone is something you are explicitly adding?

A simple workaround would be to remove the timezone component for now, using the following command:

df['start'] = df['start'].dt.tz_localize(None)
df['end'] = df['end'].dt.tz_localize(None)

Since the entire column is represented with the same timezone, doing this should not have any significant effect on the quality of your synthetic data.

Let us know if you have any questions!

@wzker11
Copy link
Author

wzker11 commented Sep 12, 2023

Hi @npatki, understood. I will try to remove the timezone and see whether it will work this time.

And by the way, for this problem,
When I generating the synthetic samples by using synthetic_data_custom_constraint = synthesizer.sample(10000), it stops with generating only 32 samples and no error shows. Could you please help me on that?
Screenshot 2023-09-11 105030

It solved when I add the parameters batch_size=1000000,max_tries_per_batch = 100000. But may I know what batch_size and max_tires_per_batch should I set to get the whole output? Or I need to try different values each time?

Looking forward to hearing from you! Thank you!

@npatki
Copy link
Contributor

npatki commented Sep 20, 2023

Hi @wzker11, I just wanted to check to see if the Inequality works after you remove the timezone part of the datetime columns?

In noticed you filed new issues for the custom constraints issues -- which is great because it's a slightly different topic.

To the extent possible, we recommend using pre-defined constraints since they are pre-vetted and tested by the SDV team. Testing your custom code (and how it interacts with the SDV modeling) is a bit tougher.

@wzker11
Copy link
Author

wzker11 commented Sep 21, 2023

Hi @npatki, thank you for your reply! The Inequality works after you remove the timezone part of the datetime columns. I understand the pre-defined constraints may be more stable, but for my original dataset, it needs more complex constraints to make the synthetic data more meaningful.

@npatki
Copy link
Contributor

npatki commented Sep 21, 2023

Great, thanks for confirming the timezone part was the culprit. I have filed #1576 to track timezone-aware datetime columns specifically, so I will close this issue in favor of that.

As for the custom constraints -- we will follow up on #1591

@npatki npatki closed this as completed Sep 21, 2023
@npatki npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Sep 21, 2023
@vijayashree-kr
Copy link

vijayashree-kr commented Dec 13, 2023

Hello. I'm encountering same issue, even when the datetime is time zone unaware. Please suggest possible workaround:
Dtypes of the data:

number int64
date datetime64[ns]
checkin_utc datetime64[ns]
checkin_local datetime64[ns]
checkout_utc datetime64[ns]
checkout_local datetime64[ns]

@npatki
Copy link
Contributor

npatki commented Dec 14, 2023

Hi @vijayashree-kr, could you please file a new issue with what you're observing? This issue is only for timezone-aware columns and has already been fixed. Thanks.

FYI In order to replicate your problem, it will also be helpful if you could include more information in the new issue that you file:

  • Your metadata dictionary
  • The code you are using to create a synthesizer, add constraints etc.
  • The full stack trace (everything printed out when you see the Error)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

3 participants