[Blocked] Use Scrub for data cleaning #218

noahho · 2025-02-28T08:24:31Z

Fix #138: NA handling in text columns
Fix #163

Summary

Add skrub>=0.3.0 dependency to handle mixed string/NA data
Integrate TableVectorizer in TabPFNClassifier to properly process text columns with NA values
Add test to verify the solution works as expected

Test plan

Added test_classifier_with_text_and_na that verifies we can fit and predict on a DataFrame with text columns containing NA values
Manually verified with additional use cases not in tests

Fixes issue #138: NA handling in text columns - Add skrub>=0.3.0 dependency to handle mixed string/NA data - Integrate TableVectorizer in TabPFNClassifier to properly process text columns with NA values - Add test to verify the solution works as expected

noahho · 2025-02-28T13:21:55Z

Okay we encountered problem, skrub 0.3.0 requires scipy 1.9.3 which isn't compatible with TabPFN

…ependency

LeoGrin · 2025-03-03T12:17:30Z

Does it fail without _handle_string_na_values? I'm surprised you need it.

…dle_string_na_values

LeoGrin · 2025-03-03T17:52:27Z

I've simplified the implementation to only rely on TableVectorizer without needing the extra function. Also bumped scikit-learn minimum version to 1.2.1 for compatibility with skrub. Note that scikit-learn 1.2.1 was released in January 2023, so it's still more than 2 years old and should be a reasonable dependency. Same for pandas 1.5.3.
Also removed the drop_null_fraction parameter since it doesn't exist in all skrub versions. The default behavior is reasonable - it only removes columns that are all NaN, which is appropriate for our use case.

noahho added 3 commits February 28, 2025 09:24

Refactor string NA handling in utils.py for improved readability

9c1b126

Format code to comply with ruff styling rules

1bbe8f0

Update skrub version to 0.2.0 for compatibility with existing scipy d…

9007260

…ependency

noahho changed the title ~~Fix NA handling in text columns~~ [Blocked] Use Scrub for data cleaning Mar 2, 2025

LeoGrin mentioned this pull request Mar 3, 2025

Bump minimal scipy version to 1.11.1 #226

Merged

LeoGrin added 2 commits March 3, 2025 17:46

Simplify text and NA handling using only TableVectorizer, remove _han…

050e953

…dle_string_na_values

Bump scikit-learn minimum version to 1.2.1 for compatibility with skrub

d94c2d7

LeoGrin and others added 6 commits March 3, 2025 17:54

Add test case for column with all NaNs

a9c739c

Bump pandas minimum version to 1.5.3 for compatibility with skrub

3910535

Merge branch 'main' into fix-text-na-clean

6d9f024

Fix merge

1c4f1ae

update max skrub

3adc939

fix ruff?

7fb2935

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Blocked] Use Scrub for data cleaning #218

[Blocked] Use Scrub for data cleaning #218

noahho commented Feb 28, 2025 •

edited by LeoGrin

Loading

noahho commented Feb 28, 2025

LeoGrin commented Mar 3, 2025

LeoGrin commented Mar 3, 2025 •

edited

Loading

[Blocked] Use Scrub for data cleaning #218

Are you sure you want to change the base?

[Blocked] Use Scrub for data cleaning #218

Conversation

noahho commented Feb 28, 2025 • edited by LeoGrin Loading

Summary

Test plan

noahho commented Feb 28, 2025

LeoGrin commented Mar 3, 2025

LeoGrin commented Mar 3, 2025 • edited Loading

noahho commented Feb 28, 2025 •

edited by LeoGrin

Loading

LeoGrin commented Mar 3, 2025 •

edited

Loading