Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Blocked] Use Scrub for data cleaning #218

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

[Blocked] Use Scrub for data cleaning #218

wants to merge 12 commits into from

Conversation

noahho
Copy link
Collaborator

@noahho noahho commented Feb 28, 2025

Fix #138: NA handling in text columns
Fix #163

Summary

  • Add skrub>=0.3.0 dependency to handle mixed string/NA data
  • Integrate TableVectorizer in TabPFNClassifier to properly process text columns with NA values
  • Add test to verify the solution works as expected

Test plan

  • Added test_classifier_with_text_and_na that verifies we can fit and predict on a DataFrame with text columns containing NA values
  • Manually verified with additional use cases not in tests

Fixes issue #138: NA handling in text columns
- Add skrub>=0.3.0 dependency to handle mixed string/NA data
- Integrate TableVectorizer in TabPFNClassifier to properly process text columns with NA values
- Add test to verify the solution works as expected
@noahho
Copy link
Collaborator Author

noahho commented Feb 28, 2025

Okay we encountered problem, skrub 0.3.0 requires scipy 1.9.3 which isn't compatible with TabPFN

@noahho noahho changed the title Fix NA handling in text columns [Blocked] Use Scrub for data cleaning Mar 2, 2025
@LeoGrin
Copy link
Collaborator

LeoGrin commented Mar 3, 2025

Does it fail without _handle_string_na_values? I'm surprised you need it.

@LeoGrin
Copy link
Collaborator

LeoGrin commented Mar 3, 2025

I've simplified the implementation to only rely on TableVectorizer without needing the extra function. Also bumped scikit-learn minimum version to 1.2.1 for compatibility with skrub. Note that scikit-learn 1.2.1 was released in January 2023, so it's still more than 2 years old and should be a reasonable dependency. Same for pandas 1.5.3.
Also removed the drop_null_fraction parameter since it doesn't exist in all skrub versions. The default behavior is reasonable - it only removes columns that are all NaN, which is appropriate for our use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support datetime and bool dtypes TabPFN fails on text with NA
2 participants