Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TabPFNRegressor preprocessing fails on bigger datasets #169

Open
LeoGrin opened this issue Feb 4, 2025 · 4 comments
Open

TabPFNRegressor preprocessing fails on bigger datasets #169

LeoGrin opened this issue Feb 4, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@LeoGrin
Copy link
Collaborator

LeoGrin commented Feb 4, 2025

See https://huggingface.co/Prior-Labs/TabPFN-v2-reg/discussions/2
It seems that QuantileTransformer fails on big datasets with message The number of quantiles cannot be greater than the number of samples used, which means TabPFN is unusable for these bigger datasets even with ignore_pretraining_contraints=True. Seems to only happen on regression? (not sure)

@LeoGrin
Copy link
Collaborator Author

LeoGrin commented Feb 4, 2025

In the preprocessing QuantileTransformers, we set the number of quantiles to be num_examples // 10 or num_examples // 5, which means that it should be lower than the number of samples, but the subsample parameter is unchanged from default 10K, which can be lower than the number of quantiles when the number of sample is large. We can either:

  • limit the number of quantiles to 10K, or
  • set the subsample really high.

I quickly checked the time cost of the second option with this test:

import numpy as np
import time
from sklearn.preprocessing import QuantileTransformer

def test_quantile_transformer_speed():
    # Use a dataset with many samples so that default subsampling is active.
    n_samples = 200_000  # more than the default subsample limit (typically 100000)
    n_features = 100
    n_quantiles = 10_000
    X = np.random.rand(n_samples, n_features)

    n_runs = 5
    default_times = []
    large_times = []

    for run in range(n_runs):
        print(f"\nRun {run + 1}/{n_runs}")
        
        # Test with default settings
        print("Testing QuantileTransformer with default subsample parameter")
        qt_default = QuantileTransformer(random_state=42, n_quantiles=n_quantiles)
        t0 = time.perf_counter()
        X_trans_default = qt_default.fit_transform(X)
        X_trans_default_2 = qt_default.transform(X)
        t1 = time.perf_counter()
        default_time = t1 - t0
        default_times.append(default_time)
        print(f"Default QuantileTransformer fit_transform time: {default_time:.6f} sec")
        print("Transformed shape:", X_trans_default.shape)

        # Test with subsample explicitly set
        print("\nTesting QuantileTransformer with subsample=100_000")
        qt_large = QuantileTransformer(subsample=100_000, random_state=42, n_quantiles=n_quantiles)
        t0 = time.perf_counter()
        X_trans_large = qt_large.fit_transform(X)
        X_trans_large_2 = qt_large.transform(X)
        t1 = time.perf_counter()
        large_time = t1 - t0
        large_times.append(large_time)
        print(f"QuantileTransformer (subsample=100_000) fit_transform time: {large_time:.6f} sec")
        print("Transformed shape:", X_trans_large.shape)

    # Print summary statistics
    print("\nSummary Statistics:")
    print(f"Default QuantileTransformer:")
    print(f"  Average time: {np.mean(default_times):.6f} sec")
    print(f"  Std dev: {np.std(default_times):.6f} sec")
    print(f"  Times: {[f'{t:.6f}' for t in default_times]}")
    
    print(f"\nQuantileTransformer (subsample=100_000):")
    print(f"  Average time: {np.mean(large_times):.6f} sec")
    print(f"  Std dev: {np.std(large_times):.6f} sec")
    print(f"  Times: {[f'{t:.6f}' for t in large_times]}")

if __name__ == '__main__':
    test_quantile_transformer_speed() 

And got:

Summary Statistics:
Default QuantileTransformer:
  Average time: 7.082734 sec
  Std dev: 0.044457 sec
  Times: ['7.070789', '7.033962', '7.103202', '7.047488', '7.158230']

QuantileTransformer (subsample=100_000):
  Average time: 5.735545 sec
  Std dev: 0.040030 sec
  Times: ['5.678141', '5.717424', '5.721183', '5.772044', '5.788931']

so surprisingly increasing the subsample seems to be a bit faster 🤔

(for 1K quantiles I get

Summary Statistics:
Default QuantileTransformer:
  Average time: 4.122261 sec
  Std dev: 0.044430 sec
  Times: ['4.209649', '4.111263', '4.086600', '4.104299', '4.099495']

QuantileTransformer (subsample=100_000):
  Average time: 4.494478 sec
  Std dev: 0.045392 sec
  Times: ['4.579021', '4.465794', '4.501846', '4.474663', '4.451065']

@noahho would you have an opinion among both options, and on whether changing this parameter after training might be an issue?

@noahho noahho marked this as a duplicate of #182 Feb 13, 2025
@noahho noahho added the bug Something isn't working label Feb 13, 2025
@noahho
Copy link
Collaborator

noahho commented Feb 13, 2025

When using ignore_pretraining_limits=True in TabPFN, the training data is subsampled (typically to 10,000 samples) before fitting the preprocessing pipeline. Currently, quantile transformers in our pipeline—configured in ReshapeFeatureDistributionsStep.get_all_preprocessors—use the original dataset size (e.g. num_examples // 5 or num_examples) to set parameters like n_quantiles. This mismatch leads to requesting more quantiles than the available number of samples (for example, 1,748,982 quantiles for only 10,000 samples), resulting in a ValueError.

Potential solution:
Location 1: Update Call Site
In the _set_transformer_and_cat_ix method of ReshapeFeatureDistributionsStep (in tabpfn/model/preprocessing.py), compute the effective sample count based on the subsampled data (e.g. 10,000 if subsampling is applied).
Location 2: Update Quantile Transformer Setup
In ReshapeFeatureDistributionsStep.get_all_preprocessors, replace usage of the original num_examples in the quantile transformer calculations with the effective sample count. This ensures that the n_quantiles parameter is dynamically set to a value that does not exceed the number of training samples available for fitting.

Additional Discussion:
@LeoGrin Changing this parameter post-hoc is not an issue because it is applied during preprocessing. This means that the quantile transformation is determined based on the actual data used for training, ensuring consistency during inference.
For datasets with fewer than 10k samples, the effective sample count remains unchanged, preserving the original quantile configuration.
Importantly, we need to maintain multiple values per quantile bucket; otherwise, the quantile transformer would degenerate to merely ranking the values. This issue is more prominent in regression tasks since we also apply quantile transformation to the target (y) values.

Relevant code

n_quantiles=max(num_examples // 10, 2),

all_preprocessors = self.get_all_preprocessors(

We subsample afterwards:
if config.subsample_ix is not None:

@LeoGrin
Copy link
Collaborator Author

LeoGrin commented Feb 13, 2025

Changing this parameter post-hoc is not an issue because it is applied during preprocessing. This means that the quantile transformation is determined based on the actual data used for training, ensuring consistency during inference.

I meant changing after pretraining.

@dgedon
Copy link

dgedon commented Feb 18, 2025

A simple fix could be to replace the lines below with n_quantiles=min(max(num_examples // 10, 2), 10_000). The quantiles are then estimated from a subsampled 10_000 samples which might lead to less accurate quantiles compared when using more samples in some cases though.

Relevant lines should be:

n_quantiles=max(num_examples // 10, 2),

n_quantiles=max(num_examples // 10, 2),

n_quantiles=max(num_examples // 5, 2),

n_quantiles=num_examples,

n_quantiles=num_examples,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants