InvalidDataError: The provided data does not match the metadata #1570

wzker11 · 2023-09-06T07:38:02Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version:1.3.0
Python version:3.11.3
Operating System:Windows

Error Description

I add the constrain for end(datetime) is larger that start(datetime), and then synthesizer.fit(df). It shows the error: "InvalidDataError: The provided data does not match the metadata:

ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"
But I have removed the rows where 'end' or 'start' is NULL.

Steps to reproduce

from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(
data=df
)
start_lessthan_end = {
'constraint_class': 'Inequality',
'constraint_parameters': {
'low_column_name': 'start',
'high_column_name': 'end'
}
}
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints([
start_lessthan_end
])
synthesizer.fit(df)

---------------------------------------------------------------------------
InvalidDataError                          Traceback (most recent call last)
Cell In[82], line 1
----> 1 synthesizer.fit(df)

File [c:\Users\wz\Anaconda3\Lib\site-packages\sdv\single_table\base.py:488](file:///C:/Users/wz/Anaconda3/Lib/site-packages/sdv/single_table/base.py:488), in BaseSynthesizer.fit(self, data)
    486 self._data_processor.reset_sampling()
    487 self._random_state_set = False
--> 488 processed_data = self._preprocess(data)
    489 self.fit_processed_data(processed_data)

File [c:\Users\wz\Anaconda3\Lib\site-packages\sdv\single_table\base.py:432](file:///C:/Users/wz/Anaconda3/Lib/site-packages/sdv/single_table/base.py:432), in BaseSynthesizer._preprocess(self, data)
    431 def _preprocess(self, data):
--> 432     self.validate(data)
    433     self._data_processor.fit(data)
    434     return self._data_processor.transform(data)

File [c:\Users\wz\Anaconda3\Lib\site-packages\sdv\single_table\base.py:260](file:///C:/Users/wz/Anaconda3/Lib/site-packages/sdv/single_table/base.py:260), in BaseSynthesizer.validate(self, data)
    257     errors += self._validate_column(data[column])
    259 if errors:
--> 260     raise InvalidDataError(errors)

InvalidDataError: The provided data does not match the metadata:

ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The text was updated successfully, but these errors were encountered:

npatki · 2023-09-08T19:16:45Z

Hi @wzker11, nice to meet you.

I'm curious what happens if you remove the constraint. Does the error go away?

The error message seems to indicate that the error is with the data itself, so I'm curious if it's the constraint that's causing it or something else.

Would you be able to share the metadata that you have? I'm especially curious about the columns involved in the constraint ('start' and 'end').

wzker11 · 2023-09-11T02:54:39Z

Hi @npatki, thank you for your reply!
The error goes away if I remove the constraints. The 'start' and 'end' is the start_time and end_time of trips, which are datetime type.
And it works if I use 'create_custom_constraint_class' to create the constraint instead of 'Inequality'. However, it appears another problem. When I generating the synthetic samples by using synthetic_data_custom_constraint = synthesizer.sample(10000), it stops with generating only 32 samples and no error shows. Could you please help me on that?

wzker11 · 2023-09-11T04:57:04Z

@npatki To add more information, the metadata is
user_id : id
start : datetime
end : datetime
location_significance : categorical
latitude : numerical
longitude : numerical
address_country : categorical
address_city : categorical
address_city_type : categorical
address_streetname : categorical
location_place_name : categorical
location_place_type : categorical
duration : numerical
weekdays : numerical
hour_of_day_gmt_7 : numerical

npatki · 2023-09-11T15:50:10Z

Hi @wzker11, thanks for confirming. In this case, I would suggest to stick with the Inequality constraint since it is intended exactly for this logic. So let's debug why the Inequality is failing.

Since your data is datetime sdtype, I'm curious about how they are represented in-memory (in the Python dateframe itself). Sometimes, there are incompatibilities between how different data science libraries represent datetime object (numpy, pandas, etc.).

Could you provide the output from running the below command?

df.dtypes

wzker11 · 2023-09-12T01:07:36Z

Hi @npatki, thank you for your reply! This is the output for df.dtypes

user_id object
start datetime64[ns, UTC]
end datetime64[ns, UTC]
location_significance object
latitude float64
longitude float64
address_country object
address_city object
address_city_type object
address_streetname object
location_place_name object
location_place_type object
duration int64
weekdays float64
hour_of_day_gmt_7 int64
dtype: object

npatki · 2023-09-12T02:58:53Z

Hi @wzker11, no problem. Thank you for your quick responses as well.

Based on this information, we were able to replicate the issue.

It appears that both your datetime columns are represented as timezone aware (with respect to UTC). Timezone aware datetime objects are not currently supported in the SDV.

As a next step, we will file a feature request to support timezone aware columns (or at the very least, clean up the error message).

Workaround

I'm curious how you are loading your dataset, and whether the timezone is something you are explicitly adding?

A simple workaround would be to remove the timezone component for now, using the following command:

df['start'] = df['start'].dt.tz_localize(None)
df['end'] = df['end'].dt.tz_localize(None)

Since the entire column is represented with the same timezone, doing this should not have any significant effect on the quality of your synthetic data.

Let us know if you have any questions!

wzker11 · 2023-09-12T03:05:56Z

Hi @npatki, understood. I will try to remove the timezone and see whether it will work this time.

And by the way, for this problem,
When I generating the synthetic samples by using synthetic_data_custom_constraint = synthesizer.sample(10000), it stops with generating only 32 samples and no error shows. Could you please help me on that?
Screenshot 2023-09-11 105030

It solved when I add the parameters batch_size=1000000,max_tries_per_batch = 100000. But may I know what batch_size and max_tires_per_batch should I set to get the whole output? Or I need to try different values each time?

Looking forward to hearing from you! Thank you!

npatki · 2023-09-20T20:03:54Z

Hi @wzker11, I just wanted to check to see if the Inequality works after you remove the timezone part of the datetime columns?

In noticed you filed new issues for the custom constraints issues -- which is great because it's a slightly different topic.

To the extent possible, we recommend using pre-defined constraints since they are pre-vetted and tested by the SDV team. Testing your custom code (and how it interacts with the SDV modeling) is a bit tougher.

wzker11 · 2023-09-21T02:59:51Z

Hi @npatki, thank you for your reply! The Inequality works after you remove the timezone part of the datetime columns. I understand the pre-defined constraints may be more stable, but for my original dataset, it needs more complex constraints to make the synthetic data more meaningful.

npatki · 2023-09-21T15:35:05Z

Great, thanks for confirming the timezone part was the culprit. I have filed #1576 to track timezone-aware datetime columns specifically, so I will close this issue in favor of that.

As for the custom constraints -- we will follow up on #1591

vijayashree-kr · 2023-12-13T19:16:53Z

Hello. I'm encountering same issue, even when the datetime is time zone unaware. Please suggest possible workaround:
Dtypes of the data:

number int64
date datetime64[ns]
checkin_utc datetime64[ns]
checkin_local datetime64[ns]
checkout_utc datetime64[ns]
checkout_local datetime64[ns]

npatki · 2023-12-14T15:21:53Z

Hi @vijayashree-kr, could you please file a new issue with what you're observing? This issue is only for timezone-aware columns and has already been fixed. Thanks.

FYI In order to replicate your problem, it will also be helpful if you could include more information in the new issue that you file:

Your metadata dictionary
The code you are using to create a synthesizer, add constraints etc.
The full stack trace (everything printed out when you see the Error)

wzker11 added bug Something isn't working new Automatic label applied to new issues labels Sep 6, 2023

npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Sep 8, 2023

npatki mentioned this issue Sep 12, 2023

Constraints should work with timezone-aware datetime columns #1576

Closed

npatki closed this as completed Sep 21, 2023

npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InvalidDataError: The provided data does not match the metadata #1570

InvalidDataError: The provided data does not match the metadata #1570

wzker11 commented Sep 6, 2023

npatki commented Sep 8, 2023

wzker11 commented Sep 11, 2023

wzker11 commented Sep 11, 2023

npatki commented Sep 11, 2023

wzker11 commented Sep 12, 2023

npatki commented Sep 12, 2023

wzker11 commented Sep 12, 2023 •

edited

Loading

npatki commented Sep 20, 2023 •

edited

Loading

wzker11 commented Sep 21, 2023

npatki commented Sep 21, 2023

vijayashree-kr commented Dec 13, 2023 •

edited

Loading

npatki commented Dec 14, 2023

InvalidDataError: The provided data does not match the metadata #1570

InvalidDataError: The provided data does not match the metadata #1570

Comments

wzker11 commented Sep 6, 2023

Environment Details

Error Description

Steps to reproduce

npatki commented Sep 8, 2023

wzker11 commented Sep 11, 2023

wzker11 commented Sep 11, 2023

npatki commented Sep 11, 2023

wzker11 commented Sep 12, 2023

npatki commented Sep 12, 2023

Workaround

wzker11 commented Sep 12, 2023 • edited Loading

npatki commented Sep 20, 2023 • edited Loading

wzker11 commented Sep 21, 2023

npatki commented Sep 21, 2023

vijayashree-kr commented Dec 13, 2023 • edited Loading

npatki commented Dec 14, 2023

wzker11 commented Sep 12, 2023 •

edited

Loading

npatki commented Sep 20, 2023 •

edited

Loading

vijayashree-kr commented Dec 13, 2023 •

edited

Loading