-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TabPFNRegressor preprocessing fails on bigger datasets #169
Comments
In the preprocessing
I quickly checked the time cost of the second option with this test:
And got:
so surprisingly increasing the subsample seems to be a bit faster 🤔 (for 1K quantiles I get
@noahho would you have an opinion among both options, and on whether changing this parameter after training might be an issue? |
When using ignore_pretraining_limits=True in TabPFN, the training data is subsampled (typically to 10,000 samples) before fitting the preprocessing pipeline. Currently, quantile transformers in our pipeline—configured in ReshapeFeatureDistributionsStep.get_all_preprocessors—use the original dataset size (e.g. num_examples // 5 or num_examples) to set parameters like n_quantiles. This mismatch leads to requesting more quantiles than the available number of samples (for example, 1,748,982 quantiles for only 10,000 samples), resulting in a ValueError. Potential solution: Additional Discussion: Relevant code TabPFN/src/tabpfn/model/preprocessing.py Line 727 in 9f208b7
TabPFN/src/tabpfn/model/preprocessing.py Line 867 in 9f208b7
We subsample afterwards: TabPFN/src/tabpfn/preprocessing.py Line 563 in 9f208b7
|
I meant changing after pretraining. |
A simple fix could be to replace the lines below with Relevant lines should be: TabPFN/src/tabpfn/model/preprocessing.py Line 722 in 9f208b7
TabPFN/src/tabpfn/model/preprocessing.py Line 727 in 9f208b7
TabPFN/src/tabpfn/model/preprocessing.py Line 737 in 9f208b7
TabPFN/src/tabpfn/model/preprocessing.py Line 742 in 9f208b7
TabPFN/src/tabpfn/model/preprocessing.py Line 747 in 9f208b7
|
See https://huggingface.co/Prior-Labs/TabPFN-v2-reg/discussions/2
It seems that QuantileTransformer fails on big datasets with message
The number of quantiles cannot be greater than the number of samples used
, which means TabPFN is unusable for these bigger datasets even withignore_pretraining_contraints=True
. Seems to only happen on regression? (not sure)The text was updated successfully, but these errors were encountered: