-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/lagged features names #1679
Conversation
…ure_importances in the relevant regression models
…nts names, create generic name for the corresponding variate, updated the tests
Updated the PR:
Each "variate" (target, past_covariates and future_covariates) is processed independently: it's possible to have a mixture of generic names and "original names". |
Codecov ReportPatch coverage:
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #1679 +/- ##
==========================================
- Coverage 94.11% 94.05% -0.06%
==========================================
Files 125 125
Lines 11447 11491 +44
==========================================
+ Hits 10773 10808 +35
- Misses 674 683 +9
... and 8 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
…e explainability module
Updated to use the same naming conventions as the explainability module: |
…co/darts into feat/lagged_features_names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks a lot @madtoinou 🚀
Ready for release! 🥳
darts/utils/data/tabularization.py
Outdated
@@ -527,6 +534,120 @@ def create_lagged_prediction_data( | |||
return X, times | |||
|
|||
|
|||
def create_lagged_components_names( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we handle the static covariates in here as well?
@@ -358,6 +361,32 @@ def _create_lagged_data( | |||
|
|||
return training_samples, training_labels | |||
|
|||
def _create_lagged_components_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this method and put everything into the helper function create_lagged_component_names
) | ||
|
||
# adding the static covariates on the right of each features_cols_name | ||
features_cols_name = self._add_static_covariates_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can move this into the helper function create_lagged_component_names
@@ -445,6 +474,41 @@ def _add_static_covariates( | |||
features = features[0] | |||
return features | |||
|
|||
def _add_static_covariates_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be part of create_lagged_component_names
in my opinion
) -> Union[np.array, Sequence[np.array]]: | ||
""" | ||
Add static covariates names to the features name for RegressionModels. | ||
Accounts for series with potentially different static covariates to accomodate for the maximum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't the number of static covariates guaranteed to be identical? The models should throw an error when using series with different static covariate numbers, no?
darts/utils/data/tabularization.py
Outdated
@@ -527,6 +534,120 @@ def create_lagged_prediction_data( | |||
return X, times | |||
|
|||
|
|||
def create_lagged_components_names( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def create_lagged_components_names( | |
def create_lagged_component_names( |
past_covariates=past_covariates, | ||
future_covariates=future_covariates, | ||
) | ||
self.model.lagged_features_name_ = lagged_features_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use a naming convention lagged_feature_names
similar to feature_importances_
from sklearn.
Also shouldn't we store this in the Darts model, rather than the actual one?
Would also require to define it in the model constructor
self.model.lagged_features_name_ = lagged_features_names | |
self.lagged_feature_names_ = lagged_feature_names |
darts/utils/data/tabularization.py
Outdated
target_series = ( | ||
[target_series] if not isinstance(target_series, Sequence) else target_series | ||
) | ||
past_covariates = ( | ||
[past_covariates] | ||
if not isinstance(past_covariates, Sequence) | ||
else past_covariates | ||
) | ||
future_covariates = ( | ||
[future_covariates] | ||
if not isinstance(future_covariates, Sequence) | ||
else future_covariates | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
target_series = ( | |
[target_series] if not isinstance(target_series, Sequence) else target_series | |
) | |
past_covariates = ( | |
[past_covariates] | |
if not isinstance(past_covariates, Sequence) | |
else past_covariates | |
) | |
future_covariates = ( | |
[future_covariates] | |
if not isinstance(future_covariates, Sequence) | |
else future_covariates | |
) | |
target_series = series2seq(target_series) | |
past_covariates = series2seq(past_covariates) | |
future_covariates = series2seq(future_covariates) |
darts/utils/data/tabularization.py
Outdated
[lags, lags_past_covariates, lags_future_covariates], | ||
["target", "past_cov", "future_cov"], | ||
): | ||
unique_components_names = set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can directly skip this iteration if variates is None (simplifies also later on)
unique_components_names = set( | |
if variate is None: | |
continue | |
unique_components_names = set( |
* feat: create and store the lagged features names in the regression models * feat: adding corresponding tests in tabularization * fix: support any kind of Sequence to generate the lagged features name * feat: verify that the number of lagged feature names matches the feature_importances in the relevant regression models * fix: if any of the variate is a sequence of ts with different components names, create generic name for the corresponding variate, updated the tests * fix: using the same naming convention for the lagged components as the explainability module * refactor and fix some type hint warnings * simplified lagged feature name generation and moved out of regression model * fix regr model tests * fix create lagged data tests * fix small bug in unit test * fix bug in unittest from last PR --------- Co-authored-by: dennisbader <[email protected]>
Fixes #1670.
Summary
Added helper functions in the tabularization and regression model in order to generate labels for the lagged features (and static covariates if applicable), which are stored in the
RegressionModel.lagged_features_name_
attribute (List[List[str]
). This enable the usage of thefeature_importances
attribute, available for some sklearn models. It was not possible before because thecreate_lagged_data
method returns arrays and lose the name of the columns.If the model was fit on a single
TimeSeries
, the attribute contains only oneList[str]
, if trained on aSequence
ofTimeSeries
, the attribute contains severalList[str]
.The Lists containing the lagged features names are always nested, the API could eventually be simplified:
TimeSeries
used during training, retain the first one onlyAdditional information
Added the corresponding tests.