diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb index 0eabeeb54..178514087 100644 --- a/notebooks/linear_models_sol_03.ipynb +++ b/notebooks/linear_models_sol_03.ipynb @@ -2,18 +2,25 @@ "cells": [ { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "lines_to_next_cell": 2 + }, "source": [ "# \ud83d\udcc3 Solution for Exercise M4.03\n", "\n", - "The parameter `penalty` can control the **type** of regularization to use,\n", - "whereas the regularization **strength** is set using the parameter `C`.\n", - "Setting`penalty=\"none\"` is equivalent to an infinitely large value of `C`. In\n", - "this exercise, we ask you to train a logistic regression classifier using the\n", - "`penalty=\"l2\"` regularization (which happens to be the default in\n", - "scikit-learn) to find by yourself the effect of the parameter `C`.\n", + "In the previous Module we tuned the hyperparameter `C` of the logistic\n", + "regression without mentioning that it controls the regularization strength.\n", + "Later, on the slides on \ud83c\udfa5 **Intuitions on regularized linear models** we\n", + "metioned that a small `C` provides a more regularized model, whereas a\n", + "non-regularized model is obtained with an infinitely large value of `C`.\n", + "Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`\n", + "model.\n", + "\n", + "In this exercise, we ask you to train a logistic regression classifier using\n", + "different values of the parameter `C` to find its effects by yourself.\n", "\n", - "We start by loading the dataset." + "We start by loading the dataset. We only keep the Adelie and Chinstrap classes\n", + "to keep the discussion simple." ] }, { @@ -36,7 +43,6 @@ "import pandas as pd\n", "\n", "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n", - "# only keep the Adelie and Chinstrap classes\n", "penguins = (\n", " penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n", ")\n", @@ -53,7 +59,9 @@ "source": [ "from sklearn.model_selection import train_test_split\n", "\n", - "penguins_train, penguins_test = train_test_split(penguins, random_state=0)\n", + "penguins_train, penguins_test = train_test_split(\n", + " penguins, random_state=0, test_size=0.4\n", + ")\n", "\n", "data_train = penguins_train[culmen_columns]\n", "data_test = penguins_test[culmen_columns]\n", @@ -66,7 +74,67 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First, let's create our predictive model." + "We define a function to help us fit a given `model` and plot its decision\n", + "boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging\n", + "colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped\n", + "to the white color. Equivalently, the darker the color, the closer the\n", + "predicted probability is to 0 or 1 and the more confident the classifier is in\n", + "its predictions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.inspection import DecisionBoundaryDisplay\n", + "\n", + "\n", + "def plot_decision_boundary(model):\n", + " model.fit(data_train, target_train)\n", + " accuracy = model.score(data_test, target_test)\n", + "\n", + " disp = DecisionBoundaryDisplay.from_estimator(\n", + " model,\n", + " data_train,\n", + " response_method=\"predict_proba\",\n", + " plot_method=\"pcolormesh\",\n", + " cmap=\"RdBu_r\",\n", + " alpha=0.8,\n", + " vmin=0.0,\n", + " vmax=1.0,\n", + " )\n", + " DecisionBoundaryDisplay.from_estimator(\n", + " model,\n", + " data_train,\n", + " response_method=\"predict_proba\",\n", + " plot_method=\"contour\",\n", + " linestyles=\"--\",\n", + " linewidths=1,\n", + " alpha=0.8,\n", + " levels=[0.5],\n", + " ax=disp.ax_,\n", + " )\n", + " sns.scatterplot(\n", + " data=penguins_train,\n", + " x=culmen_columns[0],\n", + " y=culmen_columns[1],\n", + " hue=target_column,\n", + " palette=[\"tab:blue\", \"tab:red\"],\n", + " ax=disp.ax_,\n", + " )\n", + " plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", + " plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's now create our predictive model." ] }, { @@ -79,19 +147,24 @@ "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import LogisticRegression\n", "\n", - "logistic_regression = make_pipeline(\n", - " StandardScaler(), LogisticRegression(penalty=\"l2\")\n", - ")" + "logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Given the following candidates for the `C` parameter, find out the impact of\n", - "`C` on the classifier decision boundary. You can use\n", - "`sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the\n", - "decision function boundary." + "## Influence of the parameter `C` on the decision boundary\n", + "\n", + "Given the following candidates for the `C` parameter and the\n", + "`plot_decision_boundary` function, find out the impact of `C` on the\n", + "classifier's decision boundary.\n", + "\n", + "- How does the value of `C` impact the confidence on the predictions?\n", + "- How does it impact the underfit/overfit trade-off?\n", + "- How does it impact the position and orientation of the decision boundary?\n", + "\n", + "Try to give an interpretation on the reason for such behavior." ] }, { @@ -100,41 +173,75 @@ "metadata": {}, "outputs": [], "source": [ - "Cs = [0.01, 0.1, 1, 10]\n", + "Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]\n", "\n", "# solution\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "from sklearn.inspection import DecisionBoundaryDisplay\n", - "\n", "for C in Cs:\n", " logistic_regression.set_params(logisticregression__C=C)\n", - " logistic_regression.fit(data_train, target_train)\n", - " accuracy = logistic_regression.score(data_test, target_test)\n", + " plot_decision_boundary(logistic_regression)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "solution" + ] + }, + "source": [ "\n", - " DecisionBoundaryDisplay.from_estimator(\n", - " logistic_regression,\n", - " data_test,\n", - " response_method=\"predict\",\n", - " cmap=\"RdBu_r\",\n", - " alpha=0.5,\n", - " )\n", - " sns.scatterplot(\n", - " data=penguins_test,\n", - " x=culmen_columns[0],\n", - " y=culmen_columns[1],\n", - " hue=target_column,\n", - " palette=[\"tab:red\", \"tab:blue\"],\n", - " )\n", - " plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", - " plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")" + "On this series of plots we can observe several important points. Regarding the\n", + "confidence on the predictions:\n", + "\n", + "- For low values of `C` (strong regularization), the classifier is less\n", + " confident in its predictions. We are enforcing a **spread sigmoid**.\n", + "- For high values of `C` (weak regularization), the classifier is more\n", + " confident: the areas with dark blue (very confident in predicting \"Adelie\")\n", + " and dark red (very confident in predicting \"Chinstrap\") nearly cover the\n", + " entire feature space. We are enforcing a **steep sigmoid**.\n", + "\n", + "To answer the next question, think that misclassified data points are more\n", + "costly when the classifier is more confident on the decision. Decision rules\n", + "are mostly driven by avoiding such cost. From the previous observations we can\n", + "then deduce that:\n", + "\n", + "- The smaller the `C` (the stronger the regularization), the lower the cost\n", + " of a misclassification. As more data points lay in the low-confidence\n", + " zone, the more the decision rules are influenced almost uniformly by all\n", + " the data points. This leads to a less expressive model, which may underfit.\n", + "- The higher the value of `C` (the weaker the regularization), the more the\n", + " decision is influenced by a few training points very close to the boundary,\n", + " where decisions are costly. Remember that models may overfit if the number\n", + " of samples in the training set is too small, as at least a minimum of\n", + " samples is needed to average the noise out.\n", + "\n", + "The orientation is the result of two factors: minimizing the number of\n", + "misclassified training points with high confidence and their distance to the\n", + "decision boundary (notice how the contour line tries to align with the most\n", + "misclassified data points in the dark-colored zone). This is closely related\n", + "to the value of the weights of the model, which is explained in the next part\n", + "of the exercise.\n", + "\n", + "Finally, for small values of `C` the position of the decision boundary is\n", + "affected by the class imbalance: when `C` is near zero, the model predicts the\n", + "majority class (as seen in the training set) everywhere in the feature space.\n", + "In our case, there are approximately two times more \"Adelie\" than \"Chinstrap\"\n", + "penguins. This explains why the decision boundary is shifted to the right when\n", + "`C` gets smaller. Indeed, the most regularized model predicts light blue\n", + "almost everywhere in the feature space." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Look at the impact of the `C` hyperparameter on the magnitude of the weights." + "## Impact of the regularization on the weights\n", + "\n", + "Look at the impact of the `C` hyperparameter on the magnitude of the weights.\n", + "**Hint**: You can [access pipeline\n", + "steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)\n", + "by name or position. Then you can query the attributes of that step such as\n", + "`coef_`." ] }, { @@ -144,12 +251,12 @@ "outputs": [], "source": [ "# solution\n", - "weights_ridge = []\n", + "lr_weights = []\n", "for C in Cs:\n", " logistic_regression.set_params(logisticregression__C=C)\n", " logistic_regression.fit(data_train, target_train)\n", " coefs = logistic_regression[-1].coef_[0]\n", - " weights_ridge.append(pd.Series(coefs, index=culmen_columns))" + " lr_weights.append(pd.Series(coefs, index=culmen_columns))" ] }, { @@ -162,8 +269,8 @@ }, "outputs": [], "source": [ - "weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f\"C: {C}\" for C in Cs])\n", - "weights_ridge.plot.barh()\n", + "lr_weights = pd.concat(lr_weights, axis=1, keys=[f\"C: {C}\" for C in Cs])\n", + "lr_weights.plot.barh()\n", "_ = plt.title(\"LogisticRegression weights depending of C\")" ] }, @@ -175,14 +282,101 @@ ] }, "source": [ - "We see that a small `C` will shrink the weights values toward zero. It means\n", - "that a small `C` provides a more regularized model. Thus, `C` is the inverse\n", - "of the `alpha` coefficient in the `Ridge` model.\n", "\n", - "Besides, with a strong penalty (i.e. small `C` value), the weight of the\n", - "feature \"Culmen Depth (mm)\" is almost zero. It explains why the decision\n", + "As small `C` provides a more regularized model, it shrinks the weights values\n", + "toward zero, as in the `Ridge` model.\n", + "\n", + "In particular, with a strong penalty (e.g. `C = 0.01`), the weight of the feature\n", + "named \"Culmen Depth (mm)\" is almost zero. It explains why the decision\n", "separation in the plot is almost perpendicular to the \"Culmen Length (mm)\"\n", - "feature." + "feature.\n", + "\n", + "For even stronger penalty strengths (e.g. `C = 1e-6`), the weights of both\n", + "features are almost zero. It explains why the decision separation in the plot\n", + "is almost constant in the feature space: the predicted probability is only\n", + "based on the intercept parameter of the model (which is never regularized)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Impact of the regularization on with non-linear feature engineering\n", + "\n", + "Use the `plot_decision_boundary` function to repeat the experiment using a\n", + "non-linear feature engineering pipeline. For such purpose, insert\n", + "`Nystroem(kernel=\"rbf\", gamma=1, n_components=100)` between the\n", + "`StandardScaler` and the `LogisticRegression` steps.\n", + "\n", + "- Does the value of `C` still impact the position of the decision boundary and\n", + " the confidence of the model?\n", + "- What can you say about the impact of `C` on the underfitting vs overfitting\n", + " trade-off?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.kernel_approximation import Nystroem\n", + "\n", + "# solution\n", + "classifier = make_pipeline(\n", + " StandardScaler(),\n", + " Nystroem(kernel=\"rbf\", gamma=1.0, n_components=100, random_state=0),\n", + " LogisticRegression(penalty=\"l2\", max_iter=1000),\n", + ")\n", + "\n", + "for C in Cs:\n", + " classifier.set_params(logisticregression__C=C)\n", + " plot_decision_boundary(classifier)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "solution" + ] + }, + "source": [ + "\n", + "- For the lowest values of `C`, the overall pipeline underfits: it predicts\n", + " the majority class everywhere, as previously.\n", + "- When `C` increases, the models starts to predict some datapoints from the\n", + " \"Chinstrap\" class but the model is not very confident anywhere in the\n", + " feature space.\n", + "- The decision boundary is no longer a straight line: the linear model is now\n", + " classifying in the 100-dimensional feature space created by the `Nystroem`\n", + " transformer. As are result, the decision boundary induced by the overall\n", + " pipeline is now expressive enough to wrap around the minority class.\n", + "- For `C = 1` in particular, it finds a smooth red blob around most of the\n", + " \"Chinstrap\" data points. When moving away from the data points, the model is\n", + " less confident in its predictions and again tends to predict the majority\n", + " class according to the proportion in the training set.\n", + "- For higher values of `C`, the model starts to overfit: it is very confident\n", + " in its predictions almost everywhere, but it should not be trusted: the\n", + " model also makes a larger number of mistakes on the test set (not shown in\n", + " the plot) while adopting a very curvy decision boundary to attempt fitting\n", + " all the training points, including the noisy ones at the frontier between\n", + " the two classes. This makes the decision boundary very sensitive to the\n", + " sampling of the training set and as a result, it does not generalize well in\n", + " that region. This is confirmed by the (slightly) lower accuracy on the test\n", + " set.\n", + "\n", + "Finally, we can also note that the linear model on the raw features was as\n", + "good or better than the best model using non-linear feature engineering. So in\n", + "this case, we did not really need this extra complexity in our pipeline.\n", + "**Simpler is better!**\n", + "\n", + "So to conclude, when using non-linear feature engineering, it is often\n", + "possible to make the pipeline overfit, even if the original feature space is\n", + "low-dimensional. As a result, it is important to tune the regularization\n", + "parameter in conjunction with the parameters of the transformers (e.g. tuning\n", + "`gamma` would be important here). This has a direct impact on the certainty of\n", + "the predictions." ] } ], diff --git a/notebooks/parameter_tuning_ex_02.ipynb b/notebooks/parameter_tuning_ex_02.ipynb index 46345e86b..2aa096d5c 100644 --- a/notebooks/parameter_tuning_ex_02.ipynb +++ b/notebooks/parameter_tuning_ex_02.ipynb @@ -57,7 +57,6 @@ " )\n", " ],\n", " remainder=\"passthrough\",\n", - " sparse_threshold=0,\n", ")\n", "\n", "from sklearn.ensemble import HistGradientBoostingClassifier\n", diff --git a/notebooks/parameter_tuning_grid_search.ipynb b/notebooks/parameter_tuning_grid_search.ipynb index d26aff083..e0912cb54 100644 --- a/notebooks/parameter_tuning_grid_search.ipynb +++ b/notebooks/parameter_tuning_grid_search.ipynb @@ -157,7 +157,6 @@ "preprocessor = ColumnTransformer(\n", " [(\"cat_preprocessor\", categorical_preprocessor, categorical_columns)],\n", " remainder=\"passthrough\",\n", - " sparse_threshold=0,\n", ")" ] }, diff --git a/notebooks/parameter_tuning_nested.ipynb b/notebooks/parameter_tuning_nested.ipynb index b7c14a3bf..efc43173d 100644 --- a/notebooks/parameter_tuning_nested.ipynb +++ b/notebooks/parameter_tuning_nested.ipynb @@ -70,7 +70,6 @@ " (\"cat_preprocessor\", categorical_preprocessor, categorical_columns),\n", " ],\n", " remainder=\"passthrough\",\n", - " sparse_threshold=0,\n", ")" ] }, diff --git a/notebooks/parameter_tuning_randomized_search.ipynb b/notebooks/parameter_tuning_randomized_search.ipynb index 11bfac389..3189e9301 100644 --- a/notebooks/parameter_tuning_randomized_search.ipynb +++ b/notebooks/parameter_tuning_randomized_search.ipynb @@ -121,7 +121,6 @@ "preprocessor = ColumnTransformer(\n", " [(\"cat_preprocessor\", categorical_preprocessor, categorical_columns)],\n", " remainder=\"passthrough\",\n", - " sparse_threshold=0,\n", ")" ] }, diff --git a/notebooks/parameter_tuning_sol_02.ipynb b/notebooks/parameter_tuning_sol_02.ipynb index bbcb42f88..58ef6a501 100644 --- a/notebooks/parameter_tuning_sol_02.ipynb +++ b/notebooks/parameter_tuning_sol_02.ipynb @@ -57,7 +57,6 @@ " )\n", " ],\n", " remainder=\"passthrough\",\n", - " sparse_threshold=0,\n", ")\n", "\n", "from sklearn.ensemble import HistGradientBoostingClassifier\n", diff --git a/notebooks/trees_classification.ipynb b/notebooks/trees_classification.ipynb index dfcae831c..22eae1fca 100644 --- a/notebooks/trees_classification.ipynb +++ b/notebooks/trees_classification.ipynb @@ -6,8 +6,11 @@ "source": [ "# Build a classification decision tree\n", "\n", - "We will illustrate how decision tree fit data with a simple classification\n", - "problem using the penguins dataset." + "In this notebook we illustrate decision trees in a multiclass classification\n", + "problem by using the penguins dataset with 2 features and 3 classes.\n", + "\n", + "For the sake of simplicity, we focus the discussion on the hyperparamter\n", + "`max_depth`, which controls the maximal depth of the decision tree." ] }, { @@ -38,8 +41,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Besides, we split the data into two subsets to investigate how trees will\n", - "predict values based on an out-of-samples dataset." + "First, we split the data into two subsets to investigate how trees predict\n", + "values based on unseen data." ] }, { @@ -60,16 +63,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In a previous notebook, we learnt that a linear classifier will define a\n", - "linear separation to split classes using a linear combination of the input\n", - "features. In our 2-dimensional space, it means that a linear classifier will\n", - "define some oblique lines that best separate our classes. We define a function\n", - "below that, given a set of data points and a classifier, will plot the\n", - "decision boundaries learnt by the classifier.\n", - "\n", - "Thus, for a linear classifier, we will obtain the following decision\n", - "boundaries. These boundaries lines indicate where the model changes its\n", - "prediction from one class to another." + "In a previous notebook, we learnt that linear classifiers define a linear\n", + "separation to split classes using a linear combination of the input features.\n", + "In our 2-dimensional feature space, it means that a linear classifier finds\n", + "the oblique lines that best separate the classes. This is still true for\n", + "multiclass problems, except that more than one line is fitted. We can use\n", + "`DecisionBoundaryDisplay` to plot the decision boundaries learnt by the\n", + "classifier." ] }, { @@ -91,15 +91,22 @@ "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", + "import matplotlib as mpl\n", "import seaborn as sns\n", "\n", "from sklearn.inspection import DecisionBoundaryDisplay\n", "\n", + "tab10_norm = mpl.colors.Normalize(vmin=-0.5, vmax=8.5)\n", "# create a palette to be used in the scatterplot\n", - "palette = [\"tab:red\", \"tab:blue\", \"black\"]\n", + "palette = [\"tab:blue\", \"tab:green\", \"tab:orange\"]\n", "\n", - "DecisionBoundaryDisplay.from_estimator(\n", - " linear_model, data_train, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", + "dbd = DecisionBoundaryDisplay.from_estimator(\n", + " linear_model,\n", + " data_train,\n", + " response_method=\"predict\",\n", + " cmap=\"tab10\",\n", + " norm=tab10_norm,\n", + " alpha=0.5,\n", ")\n", "sns.scatterplot(\n", " data=penguins,\n", @@ -119,7 +126,7 @@ "source": [ "We see that the lines are a combination of the input features since they are\n", "not perpendicular a specific axis. Indeed, this is due to the model\n", - "parametrization that we saw in the previous notebook, controlled by the\n", + "parametrization that we saw in some previous notebooks, i.e. controlled by the\n", "model's weights and intercept.\n", "\n", "Besides, it seems that the linear model would be a good candidate for such\n", @@ -141,13 +148,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Unlike linear models, decision trees are non-parametric models: they are not\n", - "controlled by a mathematical decision function and do not have weights or\n", - "intercept to be optimized.\n", + "Unlike linear models, the decision rule for the decision tree is not\n", + "controlled by a simple linear combination of weights and feature values.\n", + "\n", + "Instead, the decision rules of trees can be defined in terms of\n", + "- the feature index used at each split node of the tree,\n", + "- the threshold value used at each split node,\n", + "- the value to predict at each leaf node.\n", + "\n", + "Decision trees partition the feature space by considering a single feature at\n", + "a time. The number of splits depends on both the hyperparameters and the\n", + "number of data points in the training set: the more flexible the\n", + "hyperparameters and the larger the training set, the more splits can be\n", + "considered by the model.\n", "\n", - "Indeed, decision trees will partition the space by considering a single\n", - "feature at a time. Let's illustrate this behaviour by having a decision tree\n", - "make a single split to partition the feature space." + "As the number of adjustable components taking part in the decision rule\n", + "changes with the training size, we say that decision trees are non-parametric\n", + "models.\n", + "\n", + "Let's now visualize the shape of the decision boundary of a decision tree when\n", + "we set the `max_depth` hyperparameter to only allow for a single split to\n", + "partition the feature space." ] }, { @@ -169,7 +190,12 @@ "outputs": [], "source": [ "DecisionBoundaryDisplay.from_estimator(\n", - " tree, data_train, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", + " tree,\n", + " data_train,\n", + " response_method=\"predict\",\n", + " cmap=\"tab10\",\n", + " norm=tab10_norm,\n", + " alpha=0.5,\n", ")\n", "sns.scatterplot(\n", " data=penguins,\n", @@ -188,8 +214,8 @@ "source": [ "The partitions found by the algorithm separates the data along the axis\n", "\"Culmen Depth\", discarding the feature \"Culmen Length\". Thus, it highlights\n", - "that a decision tree does not use a combination of feature when making a\n", - "split. We can look more in depth at the tree structure." + "that a decision tree does not use a combination of features when making a\n", + "single split. We can look more in depth at the tree structure." ] }, { @@ -230,16 +256,16 @@ "dataset was subdivided into 2 sets based on the culmen depth (inferior or\n", "superior to 16.45 mm).\n", "\n", - "This partition of the dataset minimizes the class diversities in each\n", + "This partition of the dataset minimizes the class diversity in each\n", "sub-partitions. This measure is also known as a **criterion**, and is a\n", "settable parameter.\n", "\n", "If we look more closely at the partition, we see that the sample superior to\n", - "16.45 belongs mainly to the Adelie class. Looking at the values, we indeed\n", - "observe 103 Adelie individuals in this space. We also count 52 Chinstrap\n", - "samples and 6 Gentoo samples. We can make similar interpretation for the\n", + "16.45 belongs mainly to the \"Adelie\" class. Looking at the values, we indeed\n", + "observe 103 \"Adelie\" individuals in this space. We also count 52 \"Chinstrap\"\n", + "samples and 6 \"Gentoo\" samples. We can make similar interpretation for the\n", "partition defined by a threshold inferior to 16.45mm. In this case, the most\n", - "represented class is the Gentoo species.\n", + "represented class is the \"Gentoo\" species.\n", "\n", "Let's see how our tree would work as a predictor. Let's start with a case\n", "where the culmen depth is inferior to the threshold." @@ -251,15 +277,17 @@ "metadata": {}, "outputs": [], "source": [ - "sample_1 = pd.DataFrame({\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [15]})\n", - "tree.predict(sample_1)" + "test_penguin_1 = pd.DataFrame(\n", + " {\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [15]}\n", + ")\n", + "tree.predict(test_penguin_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The class predicted is the Gentoo. We can now check what happens if we pass a\n", + "The class predicted is the \"Gentoo\". We can now check what happens if we pass a\n", "culmen depth superior to the threshold." ] }, @@ -269,17 +297,19 @@ "metadata": {}, "outputs": [], "source": [ - "sample_2 = pd.DataFrame({\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [17]})\n", - "tree.predict(sample_2)" + "test_penguin_2 = pd.DataFrame(\n", + " {\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [17]}\n", + ")\n", + "tree.predict(test_penguin_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In this case, the tree predicts the Adelie specie.\n", + "In this case, the tree predicts the \"Adelie\" specie.\n", "\n", - "Thus, we can conclude that a decision tree classifier will predict the most\n", + "Thus, we can conclude that a decision tree classifier predicts the most\n", "represented class within a partition.\n", "\n", "During the training, we have a count of samples in each partition, we can also\n", @@ -293,7 +323,7 @@ "metadata": {}, "outputs": [], "source": [ - "y_pred_proba = tree.predict_proba(sample_2)\n", + "y_pred_proba = tree.predict_proba(test_penguin_2)\n", "y_proba_class_0 = pd.Series(y_pred_proba[0], index=tree.classes_)" ] }, @@ -338,8 +368,8 @@ "metadata": {}, "source": [ "It is also important to note that the culmen length has been disregarded for\n", - "the moment. It means that whatever the value given, it will not be used during\n", - "the prediction." + "the moment. It means that regardless of its value, it is not used during the\n", + "prediction." ] }, { @@ -348,10 +378,10 @@ "metadata": {}, "outputs": [], "source": [ - "sample_3 = pd.DataFrame(\n", + "test_penguin_3 = pd.DataFrame(\n", " {\"Culmen Length (mm)\": [10_000], \"Culmen Depth (mm)\": [17]}\n", ")\n", - "tree.predict_proba(sample_3)" + "tree.predict_proba(test_penguin_3)" ] }, { @@ -378,12 +408,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Indeed, it is not a surprise. We saw earlier that a single feature will not be\n", - "able to separate all three species. However, from the previous analysis we saw\n", - "that by using both features we should be able to get fairly good results.\n", + "Indeed, it is not a surprise. We saw earlier that a single feature is not able\n", + "to separate all three species: it underfits. However, from the previous\n", + "analysis we saw that by using both features we should be able to get fairly\n", + "good results.\n", "\n", - "In the next exercise, you will increase the size of the tree depth. You will\n", - "get intuitions on how the space partitioning is repeated over time." + "In the next exercise, you will increase the tree depth to get an intuition on\n", + "how such a parameter affects the space partitioning." ] } ], diff --git a/notebooks/trees_sol_01.ipynb b/notebooks/trees_sol_01.ipynb index c126f23fa..2ce0c1b8b 100644 --- a/notebooks/trees_sol_01.ipynb +++ b/notebooks/trees_sol_01.ipynb @@ -6,16 +6,13 @@ "source": [ "# \ud83d\udcc3 Solution for Exercise M5.01\n", "\n", - "In the previous notebook, we showed how a tree with a depth of 1 level was\n", - "working. The aim of this exercise is to repeat part of the previous experiment\n", - "for a depth with 2 levels to show how the process of partitioning is repeated\n", - "over time.\n", + "In the previous notebook, we showed how a tree with 1 level depth works. The\n", + "aim of this exercise is to repeat part of the previous experiment for a tree\n", + "with 2 levels depth to show how such parameter affects the feature space\n", + "partitioning.\n", "\n", - "Before to start, we will:\n", - "\n", - "* load the dataset;\n", - "* split the dataset into training and testing dataset;\n", - "* define the function to show the classification decision function." + "We first load the penguins dataset and split it into a training and a testing\n", + "sets:" ] }, { @@ -61,10 +58,7 @@ "metadata": {}, "source": [ "Create a decision tree classifier with a maximum depth of 2 levels and fit the\n", - "training data. Once this classifier trained, plot the data and the decision\n", - "boundary to see the benefit of increasing the depth. To plot the decision\n", - "boundary, you should import the class `DecisionBoundaryDisplay` from the\n", - "module `sklearn.inspection` as shown in the previous course notebook." + "training data." ] }, { @@ -80,24 +74,49 @@ "tree.fit(data_train, target_train)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now plot the data and the decision boundary of the trained classifier to see\n", + "the effect of increasing the depth of the tree.\n", + "\n", + "Hint: Use the class `DecisionBoundaryDisplay` from the module\n", + "`sklearn.inspection` as shown in previous course notebooks.\n", + "\n", + "
\n", + "

Warning

\n", + "

At this time, it is not possible to use response_method=\"predict_proba\" for\n", + "multiclass problems. This is a planned feature for a future version of\n", + "scikit-learn. In the mean time, you can use response_method=\"predict\"\n", + "instead.

\n", + "
" + ] + }, { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, + "metadata": {}, "outputs": [], "source": [ + "# solution\n", "import matplotlib.pyplot as plt\n", + "import matplotlib as mpl\n", "import seaborn as sns\n", "\n", "from sklearn.inspection import DecisionBoundaryDisplay\n", "\n", - "palette = [\"tab:red\", \"tab:blue\", \"black\"]\n", + "\n", + "tab10_norm = mpl.colors.Normalize(vmin=-0.5, vmax=8.5)\n", + "\n", + "palette = [\"tab:blue\", \"tab:green\", \"tab:orange\"]\n", "DecisionBoundaryDisplay.from_estimator(\n", - " tree, data_train, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", + " tree,\n", + " data_train,\n", + " response_method=\"predict\",\n", + " cmap=\"tab10\",\n", + " norm=tab10_norm,\n", + " alpha=0.5,\n", ")\n", "ax = sns.scatterplot(\n", " data=penguins,\n", @@ -184,7 +203,102 @@ "We predict an Adelie penguin if the feature value is below the threshold,\n", "which is not surprising since this partition was almost pure. If the feature\n", "value is above the threshold, we predict the Gentoo penguin, the class that is\n", - "most probable." + "most probable.\n", + "\n", + "## (Estimated) predicted probabilities in multi-class problems\n", + "\n", + "For those interested, one can further try to visualize the output of\n", + "`predict_proba` for a multiclass problem using `DecisionBoundaryDisplay`,\n", + "except that for a K-class problem you have K probability outputs for each\n", + "data point. Visualizing all these on a single plot can quickly become tricky\n", + "to interpret. It is then common to instead produce K separate plots, one for\n", + "each class, in a one-vs-rest (or one-vs-all) fashion.\n", + "\n", + "For example, in the plot below, the first plot on the left shows in yellow the\n", + "certainty on classifying a data point as belonging to the \"Adelie\" class. In\n", + "the same plot, the spectre from green to purple represents the certainty of\n", + "**not** belonging to the \"Adelie\" class. The same logic applies to the other\n", + "plots in the figure." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "solution" + ] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "xx = np.linspace(30, 60, 100)\n", + "yy = np.linspace(10, 23, 100)\n", + "xx, yy = np.meshgrid(xx, yy)\n", + "Xfull = pd.DataFrame(\n", + " {\"Culmen Length (mm)\": xx.ravel(), \"Culmen Depth (mm)\": yy.ravel()}\n", + ")\n", + "\n", + "probas = tree.predict_proba(Xfull)\n", + "n_classes = len(np.unique(tree.classes_))\n", + "\n", + "_, axs = plt.subplots(ncols=3, nrows=1, sharey=True, figsize=(12, 5))\n", + "plt.suptitle(\"Predicted probabilities for decision tree model\", y=0.8)\n", + "\n", + "for class_of_interest in range(n_classes):\n", + " axs[class_of_interest].set_title(\n", + " f\"Class {tree.classes_[class_of_interest]}\"\n", + " )\n", + " imshow_handle = axs[class_of_interest].imshow(\n", + " probas[:, class_of_interest].reshape((100, 100)),\n", + " extent=(30, 60, 10, 23),\n", + " vmin=0.0,\n", + " vmax=1.0,\n", + " origin=\"lower\",\n", + " cmap=\"viridis\",\n", + " )\n", + " axs[class_of_interest].set_xlabel(\"Culmen Length (mm)\")\n", + " if class_of_interest == 0:\n", + " axs[class_of_interest].set_ylabel(\"Culmen Depth (mm)\")\n", + " idx = target_test == tree.classes_[class_of_interest]\n", + " axs[class_of_interest].scatter(\n", + " data_test[\"Culmen Length (mm)\"].loc[idx],\n", + " data_test[\"Culmen Depth (mm)\"].loc[idx],\n", + " marker=\"o\",\n", + " c=\"w\",\n", + " edgecolor=\"k\",\n", + " )\n", + "\n", + "ax = plt.axes([0.15, 0.04, 0.7, 0.05])\n", + "plt.colorbar(imshow_handle, cax=ax, orientation=\"horizontal\")\n", + "_ = plt.title(\"Probability\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "solution" + ] + }, + "source": [ + "
\n", + "

Note

\n", + "

You may have noticed that we are no longer using a diverging colormap. Indeed,\n", + "the chance level for a one-vs-rest binarization of the multi-class\n", + "classification problem is almost never at predicted probability of 0.5. So\n", + "using a colormap with a neutral white at 0.5 might give a false impression on\n", + "the certainty.

\n", + "
\n", + "\n", + "In future versions of scikit-learn `DecisionBoundaryDisplay` will support a\n", + "`class_of_interest` parameter that will allow in particular for a\n", + "visualization of `predict_proba` in multi-class settings.\n", + "\n", + "We also plan to make it possible to visualize the `predict_proba` values for\n", + "the class with the maximum predicted probability (without having to pass a\n", + "given a fixed `class_of_interest` value)." ] } ],