-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
categorical support design proposal #36
base: main
Are you sure you want to change the base?
Changes from 1 commit
79ab183
3f3ce31
29e3c47
f7425f8
ff21347
68c3968
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
# Adding categorical support to sktime | ||
|
||
Contributors: @Abhay-Lejith | ||
|
||
## Introduction | ||
|
||
### Objectives | ||
Allow users to pass data with categorical features to any estimator. If not supported, return an informative error message. Otherwise it should run without errors. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This objective already assumes a design. "if not supported, ..." Try to separate objective from solutioning, please. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I didn't notice, will change it. |
||
|
||
### Current status: | ||
* Non-numeric dtypes are completely blocked off in datatype checks. | ||
* https://github.com/sktime/sktime/pull/5886 - This pr removes these checks but has not been merged yet as most estimators do not support categorical and will break without informative error messages. | ||
* https://github.com/sktime/sktime/pull/6490 - pr adding feature_kind metadata | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest, start with outlining a conceptual model. Explain what kind of (abstract) estimators there are, and what you want to model. Then proceed to example code that should run, try to cover different cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't this 'conceptual model' heavily depend on which design we choose to go ahead with. Or do you want me to do this for all the proposed designs?
I don't think there are any new estimators being made, changes will have to be made to existing classes and again, the change will depend on what design we go for. |
||
|
||
## Description of proposed designs | ||
|
||
|
||
For each design we will have to discuss 8 combinations of cases: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. try to give more context to the reader, what is the context/setting? no.2 in particular may be unclear without |
||
1. estimator natively supports categorical: yes/no | ||
2. categorical feature specified by user: yes/no | ||
3. categorical present in data: yes/no | ||
|
||
## Specifying categorical features: | ||
- The reason I proposed this is in case the user wants to treat a column with numerical values as categorical. This will not be detected by feature_kind as it relies only on the dtype. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "i proposed this" - missing context. Try to separate solutions from requirements, and ensure the document contains context before you reference the context. |
||
- More on how this is tackled in each design below. | ||
--- | ||
--- | ||
|
||
|
||
### Design 1. Allow categorical features ONLY in those estimators which support natively. (no encoding internally) | ||
|
||
- `categorical_features` parameter will be present only in estimators that support natively. | ||
- For estimators that do not support natively, users can make a pipeline with an encoder of their choice and use categorical variables. This is the general ml workflow, like how it's done in sklearn too. | ||
|
||
#### How tag is going to be used | ||
- in datatype checks : | ||
``` | ||
if categorical support tag is false: | ||
if categorical feature is present: | ||
error - cat feature is not supported by this estimator | ||
``` | ||
|
||
- In this case `categorical feature` will not be a parameter in models that do not support. So (NO,YES,_) is not possible. | ||
|
||
| **Estimator supports categorical natively** | **Cat feature specified by user** | **Cat feature present in data** | **Outcome** | | ||
|:-----------------:|:-------------------------:|:--------------:|:---------------:| | ||
| NO | NO | NO | normal case, proceeds as usual without error. | | ||
| NO | NO | YES| informative error message to user that categorical feature is not supported by the estimator | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thoughts: should it warn and then work by dropping specified column(s)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that's a different design, there are multiple options - encode, drop, or fail There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mentioned as "thoughts" to suggest to @Abhay-Lejith to consider this as another possibility (albeit not very desirable) too to do pros and cons. The intention seems to be miscommunicated. |
||
| NO | YES| NO | X | | ||
Abhay-Lejith marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| NO | YES| YES| X | | ||
Abhay-Lejith marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| YES| NO | NO | normal case, proceeds as usual without error. | | ||
| YES| NO | YES| use categorical columns as detected by feature_kind and proceed. | | ||
| YES| YES| NO | informative error message that specified column was not present in data.| | ||
| YES| YES| YES| use categorical columns as detected by feature_kind along with the user specified columns(union) and proceed. | | ||
|
||
- in case where specified column was not found, we can also choose to ignore and continue with warning that column was not found. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarification: what do you mean by "continue" here? For example, let's consider ARIMA for revenue forecasting and we have categorical feature of user rating. What will happen? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yarnabrina This is design 1 which allows categorical only in estimators which support natively. ARIMA does not support natively, so it would just fail saying categorical not supported. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so, if we have a different approach, I would suggest you record it and try to structure the different choices one can make here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fkiraly , the document has three major headings that say Design 1, Design 2 and Design 3. Is it really not clear? |
||
|
||
#### estimators that can be added: | ||
- xgboost, lightgbm and catboost are some models that have native support for categorical. But method of passing categorical features varies quite a lot between them (I've commented on this in umbrella issue - https://github.com/sktime/sktime/issues/6109). Making a general interface might be challenging. | ||
- So it might be better to interface them in separate classes inheriting from reduction forecaster class. (similar to darts lightgbm, see - https://unit8co.github.io/darts/generated_api/darts.models.forecasting.lgbm.html) | ||
|
||
#### PROS: | ||
- No internal encoding, comparatively easier to implement. | ||
- No major changes to core logic. | ||
- Follows general ML workflow i.e if estimator does not support, then user must preprocess their data(including encoding) and then use it. | ||
#### CONS: | ||
- Less number of estimators support categorical natively. | ||
|
||
|
||
--- | ||
--- | ||
|
||
### Design 2. Perform one hot encoding if not natively supported. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Notes: definitely can be default choice but should be configurable per column and/or overall. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this really required though? Encoders exist for this very purpose and are available to users. Why do we have to provide all that functionality inbuilt? In sklearn, there is nothing of this sort. Users can just preprocess the data and then use it in estimators. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my opinion, it is really required.
I don't think we should not do a direct comparison with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes of course. I agree that providing all options available is better than a single fixed encoding option. I meant, is the internal encoding necessary . And why? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Precisely, and they have full choice on what columns to encode, what encoder to use and they can do it externally. I don't quite get why we want to make such major changes to sktime only to provide what can already be done. Trying to cram such a large functionality into an inbuilt feature will prove to be very difficult. We would need to allow user to specify which columns to encode, which encoders to use, which encoder to use on which column etc. On the other hand, if we fix a default encoding type, then in most cases the user might have to resort to external encoding anyway. So the internal encoding is not very useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed it's a complex change, there is no doubt about that at all. But that can't be the reason to not even consider in planning design phase, complexity of the task is why it was considered for full time summer project instead of open source contributors adding parts of it based on their availability.
I think you did not consider this case, please correct me if I am wrong. I'll give a possible flow below, this is similar to @benHeid 's advanced non-linear pipeline. Suppose a forecasting pipeline has 3 steps: first it predicts few of the future-unknown I am visualising that in this case we need to encode only for the forecaster that does not have native support, not for the other one. If there are more estimators in the pipeline where same variable(s) need to be encoded, reuse same trained encoder there too instead of training again and again. This does not have to be the only way, and there can be better methods, but as of now I am not sure if this can be handled if the encoding of categorical features happen outside of |
||
|
||
- `categorical_feature` param will be present in all models. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thoughts: if this is available as a configuration instead of as an argument, will that be feasible? If so, will it help? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Abhay-Lejith @fkiraly any thoughts about this comment? |
||
|
||
#### How tag is going to be used: | ||
``` | ||
if cat support tag is false: | ||
encode cat features and continue | ||
else: | ||
continue | ||
``` | ||
|
||
| **Estimator supports categorical natively** | **Cat feature specified by user** | **Cat feature present in data** | **Outcome** | | ||
|:-----------------:|:-------------------------:|:--------------:|:---------------:| | ||
| NO | NO | NO | normal case, proceeds as usual without error. | | ||
| NO | NO | YES| encode features internally as detected by feature_kind and continue. | ||
| NO | YES| NO | error that specified column was not present in data | | ||
| NO | YES| YES| use categorical features detected by feature kind and those specified by user(union) and proceed to encode | | ||
| YES| NO | NO | normal case, proceeds without error.| | ||
| YES| NO | YES| use categorical features detected by feature_kind and proceed| | ||
| YES| YES| NO | error that specified column was not present in data| | ||
| YES| YES| YES| use categorical features detected by feature kind and those specified by user(union) and proceed| | ||
|
||
#### Estimators that will be affected: | ||
- all estimators that take exog input | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarification: are we limiting the scope of categorical support to only forecasters and to only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, I think we can start with just X. For y, there will be complications I feel. One major one I can think of immediately is univariate becoming multivariate on one-hot encoding. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think you making leaps of argumentation here, @Abhay-Lejith - you keep presupposing the "one hot encoding" idea, and silently assuming that is the only way to deal with categorical data. What I mean is, when actively asked you are certainly aware that this is not the case, but your "intuitive thought processes" still seem to assume it, as the above shows for me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fkiraly I just brought up a certain case where if the user were to one-hot encode y, it could cause issues. I never said it is the only case. Is anything wrong in this? |
||
|
||
#### PROS: | ||
- Most estimators will be able to accept categorical inputs. | ||
- If user only wants to one-hot encode, this design is easier as user does not have to do anything extra.(encoding is done internally) | ||
|
||
#### CONS: | ||
- Will be more complex to implement, may require significant changes to core logic (for internal encoding). | ||
- adding `cat_features` parameter to all estimators is not desired | ||
- Not flexible in choice of encoding (label,one-hot,etc). If user wants different types, will have to resort to externallly doing it anyway. | ||
|
||
--- | ||
--- | ||
|
||
### Design 3. Same as design 2 but NO `cat_features` parameter in model | ||
Instead, user is required to convert the required numerical features to categorical. To ease the process, we can make a `transformer` to make this conversion. | ||
|
||
- no `cat_features` parameter in models. | ||
|
||
| **Estimator supports categorical natively** | **Cat feature present in data** | **Outcome** | | ||
|:-------------------------------------------:|:-------------------------------:|:---------------:| | ||
| NO | NO | normal case, proceeds as usual without error. | | ||
| NO | YES| encode features internally as detected by feature_kind and continue. | ||
| YES| NO | normal case, proceeds without error.| | ||
| YES| YES| use categorical features detected by feature_kind and proceed| | ||
|
||
#### PROS: | ||
- Most estimators will be able to accept categorical inputs. | ||
- If user only wants to one-hot encode, this design is easier as user does not have to do anything extra.(encoding is done internally) | ||
- Removes need for adding extra parameter to all models. | ||
|
||
#### CONS: | ||
- Will be more complex to implement, may require significant changes to core logic (for internal encoding). | ||
- Not flexible in choice of encoding (label,one-hot,etc). If user wants different types, will have to resort to externally doing it anyway. | ||
- Removing parameter comes at cost of requiring user to convert numerical to categorical themselves. | ||
- If this is the case then user can just convert categorical to numerical and use the data. Providing internal encoding is kind of redundant. | ||
|
||
--- | ||
--- | ||
|
||
## Personal preference - | ||
- Design 1: | ||
- I see no need for the encoding to be done internally. Preprocessing data is a regular part of the ml workflow and encoding categorical features is part of it. If users want to use cat features in estimators that do not support natively, then they can encode it themselves externally or in pipelines and then pass it to the estimator. This is how it is everywhere else, including sklearn. | ||
- Additionally, even if we are providing internal encoding, it is not flexible and provides only a single option to the user. If user wants to use different encoding on diff columns, will have to resort to doing it externally anyway. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding, this is the case:
So, a general question: is design 2 the only design to make use of feature kind PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have clarified about design 3 in #36 (comment). Please check it out.