Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to conditionally sample some rows when using a ScalarRange constraint #1737

Closed
MarcJohler opened this issue Jan 11, 2024 · 1 comment · Fixed by #1783
Closed

Unable to conditionally sample some rows when using a ScalarRange constraint #1737

MarcJohler opened this issue Jan 11, 2024 · 1 comment · Fixed by #1783
Assignees
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic
Milestone

Comments

@MarcJohler
Copy link

MarcJohler commented Jan 11, 2024

Environment Details

  • SDV version: 1.8.0
  • Python version: 3.10.13
  • Operating System: Windows 11 Home

Error Description

In sdv/constraints/tabular.py in Scalar.Range._reverse_transform line 1187:

table_data[self._column_name] = data.astype(self._dtype)

The typecast will lead to unintended behavior like this:
typecast behaviour

Steps to reproduce

I discovered the bug while using a Gaussian Copula model containing an integer variable with valid values from 0 to 10. I applied conditional sampling with the .sample_remaining_columns method for one row with that variable set to 1. However, the sample is rejected later in sdv/single_table/base.py in BaseSingleTableSynthesizer._sample_rows because it is filtered out in line 584 sampled = self._data_processor.filter_valid(sampled) because the generated sample doesn't coincide with the conditions provided to .sample_remaining_columns.

Unfortunately, I cannot provide a minimal reproducible example due to lack of time. I hope that the above information is sufficient to reproduce the behavior. Otherwise, please comment for clarification.

@MarcJohler MarcJohler added bug Something isn't working new Automatic label applied to new issues labels Jan 11, 2024
@MarcJohler MarcJohler changed the title constraints.tabular.ScalarRange._reverse_transform: unintended decimal pruning in integer typecase constraints.tabular.ScalarRange._reverse_transform: unintended decimal pruning in integer typecast Jan 11, 2024
@npatki
Copy link
Contributor

npatki commented Jan 17, 2024

Hi @MarcJohler thanks for filing the issue with the details and providing some insight as to what's going on. We'll keep this issue open as we track the fix. Fortunately, there are a few workarounds with I'll mention below.

Reproducing the Issue

The code below reproduces the issue. Do let us know if you meant something else by your original issue.

import pandas as pd
import numpy as np

from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.sampling import Condition

data = pd.DataFrame(data={
    'A': [round(i, 2) for i in np.random.uniform(low=0, high=10, size=100)],
    'B': [round(i) for i in np.random.uniform(low=0, high=10, size=100)],
    'C': np.random.choice(['Yes', 'No', 'Maybe'], size=100)
})

metadata = SingleTableMetadata.load_from_dict({
    'columns': {
        'A': { 'sdtype': 'numerical' },
        'B': { 'sdtype': 'numerical' },
        'C': { 'sdtype': 'categorical' }
    }
})

synth = GaussianCopulaSynthesizer(metadata)

constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'B',
        'low_value': 0,
        'high_value': 10,
        'strict_boundaries': False
    }
}

synth.add_constraints([constraint])
synth.fit(data)

my_condition = Condition(num_rows=250, column_values={'B': 1})
synth.sample_from_conditions([my_condition])

Output:

ValueError: Unable to sample any rows for the given conditions. This may be because the provided values are out-of-bounds in the current model. 
Please try again with a different set of values.

Workaround 1: Preprocessing

By default, SDV synthesizers are automatically configured to enforce the observed min/max values for all columns. So there's no need to add a ScalarRange constraint.

Alternatively, you can toggle this on/off for particular columns by updating the data transformers.

from rdt.transformers.numerical import FloatFormatter

# enforce for all columns (default)
synth = GaussianCopulaSynthesizer(metadata, enforce_min_max_values=True)

# selectively enforce
synth = GaussianCopulaSynthesizer(metadata, enforce_min_max_values=False)
synth.auto_assign_transformers(data)

synth.update_transformers({
    'B': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=False)
})

synth.fit(data)

Workaround 2: Update the bounds of ScalarRange

For this particular case, updating the lower boundary to -1 seemed to work for me. I'm not entirely sure why.

constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'B',
        'low_value': -1,
        'high_value': 10,
        'strict_boundaries': False
    }
}

@npatki npatki added feature:constraints Related to inputting rules or business logic under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 17, 2024
@npatki npatki changed the title constraints.tabular.ScalarRange._reverse_transform: unintended decimal pruning in integer typecast Unable to conditionally sample some rows when using a ScalarRange constraint Jan 17, 2024
@npatki npatki removed the under discussion Issue is currently being discussed label Jan 29, 2024
@amontanez24 amontanez24 added this to the 1.10.0 milestone Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants