Unable to conditionally sample some rows when using a `ScalarRange` constraint #1737

MarcJohler · 2024-01-11T10:38:45Z

Environment Details

SDV version: 1.8.0
Python version: 3.10.13
Operating System: Windows 11 Home

Error Description

In sdv/constraints/tabular.py in Scalar.Range._reverse_transform line 1187:

table_data[self._column_name] = data.astype(self._dtype)

The typecast will lead to unintended behavior like this:

Steps to reproduce

I discovered the bug while using a Gaussian Copula model containing an integer variable with valid values from 0 to 10. I applied conditional sampling with the .sample_remaining_columns method for one row with that variable set to 1. However, the sample is rejected later in sdv/single_table/base.py in BaseSingleTableSynthesizer._sample_rows because it is filtered out in line 584 sampled = self._data_processor.filter_valid(sampled) because the generated sample doesn't coincide with the conditions provided to .sample_remaining_columns.

Unfortunately, I cannot provide a minimal reproducible example due to lack of time. I hope that the above information is sufficient to reproduce the behavior. Otherwise, please comment for clarification.

The text was updated successfully, but these errors were encountered:

npatki · 2024-01-17T02:59:47Z

Hi @MarcJohler thanks for filing the issue with the details and providing some insight as to what's going on. We'll keep this issue open as we track the fix. Fortunately, there are a few workarounds with I'll mention below.

Reproducing the Issue

The code below reproduces the issue. Do let us know if you meant something else by your original issue.

import pandas as pd
import numpy as np

from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.sampling import Condition

data = pd.DataFrame(data={
    'A': [round(i, 2) for i in np.random.uniform(low=0, high=10, size=100)],
    'B': [round(i) for i in np.random.uniform(low=0, high=10, size=100)],
    'C': np.random.choice(['Yes', 'No', 'Maybe'], size=100)
})

metadata = SingleTableMetadata.load_from_dict({
    'columns': {
        'A': { 'sdtype': 'numerical' },
        'B': { 'sdtype': 'numerical' },
        'C': { 'sdtype': 'categorical' }
    }
})

synth = GaussianCopulaSynthesizer(metadata)

constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'B',
        'low_value': 0,
        'high_value': 10,
        'strict_boundaries': False
    }
}

synth.add_constraints([constraint])
synth.fit(data)

my_condition = Condition(num_rows=250, column_values={'B': 1})
synth.sample_from_conditions([my_condition])

Output:

ValueError: Unable to sample any rows for the given conditions. This may be because the provided values are out-of-bounds in the current model. 
Please try again with a different set of values.

Workaround 1: Preprocessing

By default, SDV synthesizers are automatically configured to enforce the observed min/max values for all columns. So there's no need to add a ScalarRange constraint.

Alternatively, you can toggle this on/off for particular columns by updating the data transformers.

from rdt.transformers.numerical import FloatFormatter

# enforce for all columns (default)
synth = GaussianCopulaSynthesizer(metadata, enforce_min_max_values=True)

# selectively enforce
synth = GaussianCopulaSynthesizer(metadata, enforce_min_max_values=False)
synth.auto_assign_transformers(data)

synth.update_transformers({
    'B': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=False)
})

synth.fit(data)

Workaround 2: Update the bounds of ScalarRange

For this particular case, updating the lower boundary to -1 seemed to work for me. I'm not entirely sure why.

constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'B',
        'low_value': -1,
        'high_value': 10,
        'strict_boundaries': False
    }
}

MarcJohler added bug Something isn't working new Automatic label applied to new issues labels Jan 11, 2024

MarcJohler changed the title ~~constraints.tabular.ScalarRange._reverse_transform: unintended decimal pruning in integer typecase~~ constraints.tabular.ScalarRange._reverse_transform: unintended decimal pruning in integer typecast Jan 11, 2024

npatki added feature:constraints Related to inputting rules or business logic under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 17, 2024

npatki changed the title ~~constraints.tabular.ScalarRange._reverse_transform: unintended decimal pruning in integer typecast~~ Unable to conditionally sample some rows when using a ScalarRange constraint Jan 17, 2024

npatki removed the under discussion Issue is currently being discussed label Jan 29, 2024

fealho mentioned this issue Feb 7, 2024

Fix conditional sampling with constraints #1783

Merged

fealho closed this as completed in #1783 Feb 12, 2024

amontanez24 assigned fealho Feb 14, 2024

amontanez24 added this to the 1.10.0 milestone Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to conditionally sample some rows when using a `ScalarRange` constraint #1737

Unable to conditionally sample some rows when using a `ScalarRange` constraint #1737

MarcJohler commented Jan 11, 2024 •

edited

Loading

npatki commented Jan 17, 2024 •

edited

Loading

Unable to conditionally sample some rows when using a ScalarRange constraint #1737

Unable to conditionally sample some rows when using a ScalarRange constraint #1737

Comments

MarcJohler commented Jan 11, 2024 • edited Loading

Environment Details

Error Description

Steps to reproduce

npatki commented Jan 17, 2024 • edited Loading

Reproducing the Issue

Workaround 1: Preprocessing

Workaround 2: Update the bounds of ScalarRange

Unable to conditionally sample some rows when using a `ScalarRange` constraint #1737

Unable to conditionally sample some rows when using a `ScalarRange` constraint #1737

MarcJohler commented Jan 11, 2024 •

edited

Loading

npatki commented Jan 17, 2024 •

edited

Loading