Skip to content

Fix mrmr working with categoricals #1311

Merged
merged 4 commits into from
Jul 10, 2023
Merged

Fix mrmr working with categoricals #1311

merged 4 commits into from
Jul 10, 2023

Conversation

alex-hse-repository
Copy link
Collaborator

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Proposed Changes

  • Cast all categoricals to floats during redundancy computation in mrmr.

Closing issues

closes #1310

@codecov-commenter
Copy link

codecov-commenter commented Jul 10, 2023

Codecov Report

Merging #1311 (2911e30) into master (e486a24) will decrease coverage by 32.17%.
The diff coverage is 0.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@             Coverage Diff             @@
##           master    #1311       +/-   ##
===========================================
- Coverage   88.20%   56.03%   -32.17%     
===========================================
  Files         186      197       +11     
  Lines       11807    12126      +319     
===========================================
- Hits        10414     6795     -3619     
- Misses       1393     5331     +3938     
Impacted Files Coverage Δ
etna/analysis/feature_selection/mrmr_selection.py 26.66% <0.00%> (-73.34%) ⬇️

... and 134 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link

github-actions bot commented Jul 10, 2023

@github-actions github-actions bot temporarily deployed to pull request July 10, 2023 11:49 Inactive
@@ -82,6 +82,14 @@ def mrmr(
redundancy_table = pd.DataFrame(np.inf, index=all_features, columns=all_features)
top_k = min(top_k, len(all_features))

# can't compute correlation of categorical column with the others
cat_cols = regressors.dtypes[regressors.dtypes == "category"].index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be easier to cast all regressors to float?

try:
regressors[cat_col] = regressors[cat_col].astype(float)
except ValueError:
raise ValueError(f"{cat_col} column cannot be cast to float type! Please, use encoders.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should test this error.

@github-actions github-actions bot temporarily deployed to pull request July 10, 2023 13:05 Inactive
@@ -32,6 +33,12 @@ def df_with_regressors() -> Dict[str, pd.DataFrame]:
regressor = df_regressors_useless[df_regressors_useless["segment"] == segment]["target"].values
df_exog[f"regressor_useless_{i}"] = regressor

# useless categorical regressor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regressor -> regressors?



@pytest.mark.parametrize("fast_redundancy", [True, False])
def test_mrmr_with_categorical_regressor(df_with_regressors, fast_redundancy):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_mrmr_with_categorical_regressor -> test_mrmr_with_castable_categorical_regressor?

@github-actions github-actions bot temporarily deployed to pull request July 10, 2023 14:01 Inactive
@Mr-Geekman Mr-Geekman merged commit 0225a51 into master Jul 10, 2023
@Mr-Geekman Mr-Geekman deleted the issue-1310 branch July 10, 2023 15:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Error on using MRMRFeatureSelectionTransform with fast_redundancy=False
3 participants