Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove workarounds from Catboost incompatibility with string categories #4051

Open
tamargrey opened this issue Mar 6, 2023 · 3 comments
Open
Assignees

Comments

@tamargrey
Copy link
Contributor

tamargrey commented Mar 6, 2023

Catboost currently has an incompatibility with columns that have the category dtype and string categories catboost/catboost#1965. Because of this, we have two workarounds in place that we should remove whenever this issue is resolved.

  1. In the imputer refactor for handling nullable types, we stopped being able to recognize boolean categorical columns, which allowed us to use the logical types from the Email and URL primitives. This surfaced the catboost incompatibility at _ExtractFeaturesWithTransformPrimitives, so we convert to object dtype (here) and reinitialize woodwork to change the string categories produced by the primitives to object categories that catboost can handle.
  2. In the Catboost estimators, we have to handle float categories as part of a different catboost requirement. We currently handle this in a clunky way, but converting the categories to string would likely be a much simpler solution. We should also use apply to do that change as noted in Use .apply to change categories' dtype in handle_float_categories_for_catboost #3973, which will also be a nicer way to do this.
@tamargrey
Copy link
Contributor Author

The string categories problem seems to have been fixed. The fix was added into catboost via catboost/catboost#2096, and we're already on catboost 1.1.1, so we should confirm that the problem with string categories as we ran into it no longer exists and remove the handlings put in place to avoid it.

We should check if support for float categories was added, and if so, remove all handlings. Otherwise, just use the string conversion. We should consider #3973 when implementing this, as .apply may still be the nicest way to make the category conversion.

@tamargrey
Copy link
Contributor Author

tamargrey commented Apr 26, 2023

I realized we were on catboost 1.1.1 at the time I made this ticket, so I looked into more specifics. The bugfix seems to have covered data with the string dtype but not string categories in category data (in some cases - idk what exactly the difference is).

Things I've learned

  • For the change to handle_float_categories_for_catboost, it seems like we could apply(str) (which actually turns the categories into the object dtype, or apply(int) which gets the current behavior) rather than astype(str).astype(category), as the second maintains the original string bug, but apply doesn't.
  • The test that triggers the catboost failure from being combined with the email featurizer is test_get_component_input_logical_types

@tamargrey
Copy link
Contributor Author

tamargrey commented May 2, 2023

Catboost 1.2.0 was just released with fixes to the string categories bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants