TabPFNRegressor preprocessing fails on bigger datasets #169

LeoGrin · 2025-02-04T17:48:36Z

See https://huggingface.co/Prior-Labs/TabPFN-v2-reg/discussions/2
It seems that QuantileTransformer fails on big datasets with message The number of quantiles cannot be greater than the number of samples used, which means TabPFN is unusable for these bigger datasets even with ignore_pretraining_contraints=True. Seems to only happen on regression? (not sure)

The text was updated successfully, but these errors were encountered:

LeoGrin · 2025-02-04T20:48:31Z

In the preprocessing QuantileTransformers, we set the number of quantiles to be num_examples // 10 or num_examples // 5, which means that it should be lower than the number of samples, but the subsample parameter is unchanged from default 10K, which can be lower than the number of quantiles when the number of sample is large. We can either:

limit the number of quantiles to 10K, or
set the subsample really high.

I quickly checked the time cost of the second option with this test:

import numpy as np
import time
from sklearn.preprocessing import QuantileTransformer

def test_quantile_transformer_speed():
    # Use a dataset with many samples so that default subsampling is active.
    n_samples = 200_000  # more than the default subsample limit (typically 100000)
    n_features = 100
    n_quantiles = 10_000
    X = np.random.rand(n_samples, n_features)

    n_runs = 5
    default_times = []
    large_times = []

    for run in range(n_runs):
        print(f"\nRun {run + 1}/{n_runs}")
        
        # Test with default settings
        print("Testing QuantileTransformer with default subsample parameter")
        qt_default = QuantileTransformer(random_state=42, n_quantiles=n_quantiles)
        t0 = time.perf_counter()
        X_trans_default = qt_default.fit_transform(X)
        X_trans_default_2 = qt_default.transform(X)
        t1 = time.perf_counter()
        default_time = t1 - t0
        default_times.append(default_time)
        print(f"Default QuantileTransformer fit_transform time: {default_time:.6f} sec")
        print("Transformed shape:", X_trans_default.shape)

        # Test with subsample explicitly set
        print("\nTesting QuantileTransformer with subsample=100_000")
        qt_large = QuantileTransformer(subsample=100_000, random_state=42, n_quantiles=n_quantiles)
        t0 = time.perf_counter()
        X_trans_large = qt_large.fit_transform(X)
        X_trans_large_2 = qt_large.transform(X)
        t1 = time.perf_counter()
        large_time = t1 - t0
        large_times.append(large_time)
        print(f"QuantileTransformer (subsample=100_000) fit_transform time: {large_time:.6f} sec")
        print("Transformed shape:", X_trans_large.shape)

    # Print summary statistics
    print("\nSummary Statistics:")
    print(f"Default QuantileTransformer:")
    print(f"  Average time: {np.mean(default_times):.6f} sec")
    print(f"  Std dev: {np.std(default_times):.6f} sec")
    print(f"  Times: {[f'{t:.6f}' for t in default_times]}")
    
    print(f"\nQuantileTransformer (subsample=100_000):")
    print(f"  Average time: {np.mean(large_times):.6f} sec")
    print(f"  Std dev: {np.std(large_times):.6f} sec")
    print(f"  Times: {[f'{t:.6f}' for t in large_times]}")

if __name__ == '__main__':
    test_quantile_transformer_speed()

And got:

Summary Statistics:
Default QuantileTransformer:
  Average time: 7.082734 sec
  Std dev: 0.044457 sec
  Times: ['7.070789', '7.033962', '7.103202', '7.047488', '7.158230']

QuantileTransformer (subsample=100_000):
  Average time: 5.735545 sec
  Std dev: 0.040030 sec
  Times: ['5.678141', '5.717424', '5.721183', '5.772044', '5.788931']

so surprisingly increasing the subsample seems to be a bit faster 🤔

(for 1K quantiles I get

Summary Statistics:
Default QuantileTransformer:
  Average time: 4.122261 sec
  Std dev: 0.044430 sec
  Times: ['4.209649', '4.111263', '4.086600', '4.104299', '4.099495']

QuantileTransformer (subsample=100_000):
  Average time: 4.494478 sec
  Std dev: 0.045392 sec
  Times: ['4.579021', '4.465794', '4.501846', '4.474663', '4.451065']

@noahho would you have an opinion among both options, and on whether changing this parameter after training might be an issue?

noahho · 2025-02-13T16:08:43Z

When using ignore_pretraining_limits=True in TabPFN, the training data is subsampled (typically to 10,000 samples) before fitting the preprocessing pipeline. Currently, quantile transformers in our pipeline—configured in ReshapeFeatureDistributionsStep.get_all_preprocessors—use the original dataset size (e.g. num_examples // 5 or num_examples) to set parameters like n_quantiles. This mismatch leads to requesting more quantiles than the available number of samples (for example, 1,748,982 quantiles for only 10,000 samples), resulting in a ValueError.

Potential solution:
Location 1: Update Call Site
In the _set_transformer_and_cat_ix method of ReshapeFeatureDistributionsStep (in tabpfn/model/preprocessing.py), compute the effective sample count based on the subsampled data (e.g. 10,000 if subsampling is applied).
Location 2: Update Quantile Transformer Setup
In ReshapeFeatureDistributionsStep.get_all_preprocessors, replace usage of the original num_examples in the quantile transformer calculations with the effective sample count. This ensures that the n_quantiles parameter is dynamically set to a value that does not exceed the number of training samples available for fitting.

Additional Discussion:
@LeoGrin Changing this parameter post-hoc is not an issue because it is applied during preprocessing. This means that the quantile transformation is determined based on the actual data used for training, ensuring consistency during inference.
For datasets with fewer than 10k samples, the effective sample count remains unchanged, preserving the original quantile configuration.
Importantly, we need to maintain multiple values per quantile bucket; otherwise, the quantile transformer would degenerate to merely ranking the values. This issue is more prominent in regression tasks since we also apply quantile transformation to the target (y) values.

Relevant code

TabPFN/src/tabpfn/model/preprocessing.py

Line 727 in 9f208b7

n_quantiles=max(num_examples // 10, 2),

TabPFN/src/tabpfn/model/preprocessing.py

Line 867 in 9f208b7

all_preprocessors = self.get_all_preprocessors(

We subsample afterwards:

TabPFN/src/tabpfn/preprocessing.py

Line 563 in 9f208b7

if config.subsample_ix is not None:

LeoGrin · 2025-02-13T17:06:49Z

Changing this parameter post-hoc is not an issue because it is applied during preprocessing. This means that the quantile transformation is determined based on the actual data used for training, ensuring consistency during inference.

I meant changing after pretraining.

dgedon · 2025-02-18T09:08:31Z

A simple fix could be to replace the lines below with n_quantiles=min(max(num_examples // 10, 2), 10_000). The quantiles are then estimated from a subsampled 10_000 samples which might lead to less accurate quantiles compared when using more samples in some cases though.

Relevant lines should be:

TabPFN/src/tabpfn/model/preprocessing.py

Line 722 in 9f208b7

n_quantiles=max(num_examples // 10, 2),

TabPFN/src/tabpfn/model/preprocessing.py

Line 727 in 9f208b7

n_quantiles=max(num_examples // 10, 2),

TabPFN/src/tabpfn/model/preprocessing.py

Line 737 in 9f208b7

n_quantiles=max(num_examples // 5, 2),

TabPFN/src/tabpfn/model/preprocessing.py

Line 742 in 9f208b7

n_quantiles=num_examples,

TabPFN/src/tabpfn/model/preprocessing.py

Line 747 in 9f208b7

n_quantiles=num_examples,

noahho marked this as a duplicate of #182 Feb 13, 2025

noahho added the bug Something isn't working label Feb 13, 2025

LeoGrin mentioned this issue Feb 17, 2025

ValueError: The number of quantiles cannot be greater than the number of samples used #121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TabPFNRegressor preprocessing fails on bigger datasets #169

TabPFNRegressor preprocessing fails on bigger datasets #169

LeoGrin commented Feb 4, 2025

LeoGrin commented Feb 4, 2025

noahho commented Feb 13, 2025

LeoGrin commented Feb 13, 2025

dgedon commented Feb 18, 2025

TabPFNRegressor preprocessing fails on bigger datasets #169

TabPFNRegressor preprocessing fails on bigger datasets #169

Comments

LeoGrin commented Feb 4, 2025

LeoGrin commented Feb 4, 2025

noahho commented Feb 13, 2025

LeoGrin commented Feb 13, 2025

dgedon commented Feb 18, 2025