-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
15 changed files
with
106 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
from typing import Union | ||
|
||
import pandas as pd | ||
from numpy import log2 | ||
from scipy.stats import entropy | ||
|
||
|
||
def column_imbalance_score( | ||
value_counts: pd.Series, n_classes: int | ||
) -> Union[float, int]: | ||
"""column_imbalance_score | ||
The class balance score for categorical and boolean variables uses entropy to calculate a bounded score between 0 and 1. | ||
A perfectly uniform distribution would return a score of 0, and a perfectly imbalanced distribution would return a score of 1. | ||
When dealing with probabilities with finite values (e.g categorical), entropy is maximised the ‘flatter’ the distribution is. (Jaynes: Probability Theory, The Logic of Science) | ||
To calculate the class imbalance, we calculate the entropy of that distribution and the maximum possible entropy for that number of classes. | ||
To calculate the entropy of the 'distribution' we use value counts (e.g frequency of classes) and we can determine the maximum entropy as log2(number of classes). | ||
We then divide the entropy by the maximum possible entropy to get a value between 0 and 1 which we then subtract from 1. | ||
Args: | ||
value_counts (pd.Series): frequency of each category | ||
n_classes (int): number of classes | ||
Returns: | ||
Union[float, int]: float or integer bounded between 0 and 1 inclusively | ||
""" | ||
# return 0 if there is only one class (when entropy =0) as it is balanced. | ||
# note that this also prevents a zero division error with log2(n_classes) | ||
if n_classes > 1: | ||
# casting to numpy array to ensure correct dtype when a categorical integer | ||
# variable is evaluated | ||
value_counts = value_counts.to_numpy(dtype=float) | ||
return 1 - (entropy(value_counts, base=2) / log2(n_classes)) | ||
return 0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 1 addition & 0 deletions
1
src/pandas_profiling/report/presentation/flavours/html/templates/alerts/alert_imbalance.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
<a href="#pp_var_{{ alert.anchor_id }}"><code>{{ alert.column_name }}</code></a> is highly imbalanced ({{ alert.values['imbalance'] | fmt_percent}}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
import pandas as pd | ||
|
||
from pandas_profiling.model.pandas.imbalance_pandas import column_imbalance_score | ||
|
||
|
||
def test_column_imbalance_score_many_classes(): | ||
value_counts = pd.Series([10, 20, 60, 10]) | ||
assert column_imbalance_score(value_counts, len(value_counts)).round(2) == 0.21 | ||
|
||
|
||
def test_column_imbalance_score_uniform_distribution(): | ||
value_counts = pd.Series([10, 10, 10, 10, 10]) | ||
assert column_imbalance_score(value_counts, len(value_counts)).round(2) == 0 | ||
|
||
|
||
def test_column_imbalance_score_one_class(): | ||
value_counts = [30] | ||
assert column_imbalance_score(value_counts, len(value_counts)) == 0 |