-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to conditionally sample some rows when using a ScalarRange
constraint
#1737
Comments
Hi @MarcJohler thanks for filing the issue with the details and providing some insight as to what's going on. We'll keep this issue open as we track the fix. Fortunately, there are a few workarounds with I'll mention below. Reproducing the IssueThe code below reproduces the issue. Do let us know if you meant something else by your original issue. import pandas as pd
import numpy as np
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.sampling import Condition
data = pd.DataFrame(data={
'A': [round(i, 2) for i in np.random.uniform(low=0, high=10, size=100)],
'B': [round(i) for i in np.random.uniform(low=0, high=10, size=100)],
'C': np.random.choice(['Yes', 'No', 'Maybe'], size=100)
})
metadata = SingleTableMetadata.load_from_dict({
'columns': {
'A': { 'sdtype': 'numerical' },
'B': { 'sdtype': 'numerical' },
'C': { 'sdtype': 'categorical' }
}
})
synth = GaussianCopulaSynthesizer(metadata)
constraint = {
'constraint_class': 'ScalarRange',
'constraint_parameters': {
'column_name': 'B',
'low_value': 0,
'high_value': 10,
'strict_boundaries': False
}
}
synth.add_constraints([constraint])
synth.fit(data)
my_condition = Condition(num_rows=250, column_values={'B': 1})
synth.sample_from_conditions([my_condition]) Output:
Workaround 1: PreprocessingBy default, SDV synthesizers are automatically configured to enforce the observed min/max values for all columns. So there's no need to add a Alternatively, you can toggle this on/off for particular columns by updating the data transformers. from rdt.transformers.numerical import FloatFormatter
# enforce for all columns (default)
synth = GaussianCopulaSynthesizer(metadata, enforce_min_max_values=True)
# selectively enforce
synth = GaussianCopulaSynthesizer(metadata, enforce_min_max_values=False)
synth.auto_assign_transformers(data)
synth.update_transformers({
'B': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=False)
})
synth.fit(data) Workaround 2: Update the bounds of ScalarRangeFor this particular case, updating the lower boundary to -1 seemed to work for me. I'm not entirely sure why. constraint = {
'constraint_class': 'ScalarRange',
'constraint_parameters': {
'column_name': 'B',
'low_value': -1,
'high_value': 10,
'strict_boundaries': False
}
} |
ScalarRange
constraint
Environment Details
Error Description
In sdv/constraints/tabular.py in Scalar.Range._reverse_transform line 1187:
The typecast will lead to unintended behavior like this:
![typecast behaviour](https://private-user-images.githubusercontent.com/50054200/295877972-e9b4bf23-e189-4d80-8700-bdf018d7474b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwNzYxODIsIm5iZiI6MTczOTA3NTg4MiwicGF0aCI6Ii81MDA1NDIwMC8yOTU4Nzc5NzItZTliNGJmMjMtZTE4OS00ZDgwLTg3MDAtYmRmMDE4ZDc0NzRiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA5VDA0MzgwMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTA5YjE1YWI4ZDcxNjkwNWJjOWNkYzEwZDVlM2RlZWNhOTFkYmU5ZDEzY2M0ZjhlOWI5ZGI2ZGJhMDdkYTY1MjAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.pvny4Qrsk3kPog29ZyGu_3WU1cOJElhV2drj91eb4cY)
Steps to reproduce
I discovered the bug while using a Gaussian Copula model containing an integer variable with valid values from 0 to 10. I applied conditional sampling with the .sample_remaining_columns method for one row with that variable set to 1. However, the sample is rejected later in sdv/single_table/base.py in BaseSingleTableSynthesizer._sample_rows because it is filtered out in line 584
sampled = self._data_processor.filter_valid(sampled)
because the generated sample doesn't coincide with the conditions provided to .sample_remaining_columns.Unfortunately, I cannot provide a minimal reproducible example due to lack of time. I hope that the above information is sufficient to reproduce the behavior. Otherwise, please comment for clarification.
The text was updated successfully, but these errors were encountered: