diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb index 0eabeeb54..178514087 100644 --- a/notebooks/linear_models_sol_03.ipynb +++ b/notebooks/linear_models_sol_03.ipynb @@ -2,18 +2,25 @@ "cells": [ { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "lines_to_next_cell": 2 + }, "source": [ "# \ud83d\udcc3 Solution for Exercise M4.03\n", "\n", - "The parameter `penalty` can control the **type** of regularization to use,\n", - "whereas the regularization **strength** is set using the parameter `C`.\n", - "Setting`penalty=\"none\"` is equivalent to an infinitely large value of `C`. In\n", - "this exercise, we ask you to train a logistic regression classifier using the\n", - "`penalty=\"l2\"` regularization (which happens to be the default in\n", - "scikit-learn) to find by yourself the effect of the parameter `C`.\n", + "In the previous Module we tuned the hyperparameter `C` of the logistic\n", + "regression without mentioning that it controls the regularization strength.\n", + "Later, on the slides on \ud83c\udfa5 **Intuitions on regularized linear models** we\n", + "metioned that a small `C` provides a more regularized model, whereas a\n", + "non-regularized model is obtained with an infinitely large value of `C`.\n", + "Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`\n", + "model.\n", + "\n", + "In this exercise, we ask you to train a logistic regression classifier using\n", + "different values of the parameter `C` to find its effects by yourself.\n", "\n", - "We start by loading the dataset." + "We start by loading the dataset. We only keep the Adelie and Chinstrap classes\n", + "to keep the discussion simple." ] }, { @@ -36,7 +43,6 @@ "import pandas as pd\n", "\n", "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n", - "# only keep the Adelie and Chinstrap classes\n", "penguins = (\n", " penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n", ")\n", @@ -53,7 +59,9 @@ "source": [ "from sklearn.model_selection import train_test_split\n", "\n", - "penguins_train, penguins_test = train_test_split(penguins, random_state=0)\n", + "penguins_train, penguins_test = train_test_split(\n", + " penguins, random_state=0, test_size=0.4\n", + ")\n", "\n", "data_train = penguins_train[culmen_columns]\n", "data_test = penguins_test[culmen_columns]\n", @@ -66,7 +74,67 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First, let's create our predictive model." + "We define a function to help us fit a given `model` and plot its decision\n", + "boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging\n", + "colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped\n", + "to the white color. Equivalently, the darker the color, the closer the\n", + "predicted probability is to 0 or 1 and the more confident the classifier is in\n", + "its predictions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.inspection import DecisionBoundaryDisplay\n", + "\n", + "\n", + "def plot_decision_boundary(model):\n", + " model.fit(data_train, target_train)\n", + " accuracy = model.score(data_test, target_test)\n", + "\n", + " disp = DecisionBoundaryDisplay.from_estimator(\n", + " model,\n", + " data_train,\n", + " response_method=\"predict_proba\",\n", + " plot_method=\"pcolormesh\",\n", + " cmap=\"RdBu_r\",\n", + " alpha=0.8,\n", + " vmin=0.0,\n", + " vmax=1.0,\n", + " )\n", + " DecisionBoundaryDisplay.from_estimator(\n", + " model,\n", + " data_train,\n", + " response_method=\"predict_proba\",\n", + " plot_method=\"contour\",\n", + " linestyles=\"--\",\n", + " linewidths=1,\n", + " alpha=0.8,\n", + " levels=[0.5],\n", + " ax=disp.ax_,\n", + " )\n", + " sns.scatterplot(\n", + " data=penguins_train,\n", + " x=culmen_columns[0],\n", + " y=culmen_columns[1],\n", + " hue=target_column,\n", + " palette=[\"tab:blue\", \"tab:red\"],\n", + " ax=disp.ax_,\n", + " )\n", + " plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", + " plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's now create our predictive model." ] }, { @@ -79,19 +147,24 @@ "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import LogisticRegression\n", "\n", - "logistic_regression = make_pipeline(\n", - " StandardScaler(), LogisticRegression(penalty=\"l2\")\n", - ")" + "logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Given the following candidates for the `C` parameter, find out the impact of\n", - "`C` on the classifier decision boundary. You can use\n", - "`sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the\n", - "decision function boundary." + "## Influence of the parameter `C` on the decision boundary\n", + "\n", + "Given the following candidates for the `C` parameter and the\n", + "`plot_decision_boundary` function, find out the impact of `C` on the\n", + "classifier's decision boundary.\n", + "\n", + "- How does the value of `C` impact the confidence on the predictions?\n", + "- How does it impact the underfit/overfit trade-off?\n", + "- How does it impact the position and orientation of the decision boundary?\n", + "\n", + "Try to give an interpretation on the reason for such behavior." ] }, { @@ -100,41 +173,75 @@ "metadata": {}, "outputs": [], "source": [ - "Cs = [0.01, 0.1, 1, 10]\n", + "Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]\n", "\n", "# solution\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "from sklearn.inspection import DecisionBoundaryDisplay\n", - "\n", "for C in Cs:\n", " logistic_regression.set_params(logisticregression__C=C)\n", - " logistic_regression.fit(data_train, target_train)\n", - " accuracy = logistic_regression.score(data_test, target_test)\n", + " plot_decision_boundary(logistic_regression)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "solution" + ] + }, + "source": [ "\n", - " DecisionBoundaryDisplay.from_estimator(\n", - " logistic_regression,\n", - " data_test,\n", - " response_method=\"predict\",\n", - " cmap=\"RdBu_r\",\n", - " alpha=0.5,\n", - " )\n", - " sns.scatterplot(\n", - " data=penguins_test,\n", - " x=culmen_columns[0],\n", - " y=culmen_columns[1],\n", - " hue=target_column,\n", - " palette=[\"tab:red\", \"tab:blue\"],\n", - " )\n", - " plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", - " plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")" + "On this series of plots we can observe several important points. Regarding the\n", + "confidence on the predictions:\n", + "\n", + "- For low values of `C` (strong regularization), the classifier is less\n", + " confident in its predictions. We are enforcing a **spread sigmoid**.\n", + "- For high values of `C` (weak regularization), the classifier is more\n", + " confident: the areas with dark blue (very confident in predicting \"Adelie\")\n", + " and dark red (very confident in predicting \"Chinstrap\") nearly cover the\n", + " entire feature space. We are enforcing a **steep sigmoid**.\n", + "\n", + "To answer the next question, think that misclassified data points are more\n", + "costly when the classifier is more confident on the decision. Decision rules\n", + "are mostly driven by avoiding such cost. From the previous observations we can\n", + "then deduce that:\n", + "\n", + "- The smaller the `C` (the stronger the regularization), the lower the cost\n", + " of a misclassification. As more data points lay in the low-confidence\n", + " zone, the more the decision rules are influenced almost uniformly by all\n", + " the data points. This leads to a less expressive model, which may underfit.\n", + "- The higher the value of `C` (the weaker the regularization), the more the\n", + " decision is influenced by a few training points very close to the boundary,\n", + " where decisions are costly. Remember that models may overfit if the number\n", + " of samples in the training set is too small, as at least a minimum of\n", + " samples is needed to average the noise out.\n", + "\n", + "The orientation is the result of two factors: minimizing the number of\n", + "misclassified training points with high confidence and their distance to the\n", + "decision boundary (notice how the contour line tries to align with the most\n", + "misclassified data points in the dark-colored zone). This is closely related\n", + "to the value of the weights of the model, which is explained in the next part\n", + "of the exercise.\n", + "\n", + "Finally, for small values of `C` the position of the decision boundary is\n", + "affected by the class imbalance: when `C` is near zero, the model predicts the\n", + "majority class (as seen in the training set) everywhere in the feature space.\n", + "In our case, there are approximately two times more \"Adelie\" than \"Chinstrap\"\n", + "penguins. This explains why the decision boundary is shifted to the right when\n", + "`C` gets smaller. Indeed, the most regularized model predicts light blue\n", + "almost everywhere in the feature space." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Look at the impact of the `C` hyperparameter on the magnitude of the weights." + "## Impact of the regularization on the weights\n", + "\n", + "Look at the impact of the `C` hyperparameter on the magnitude of the weights.\n", + "**Hint**: You can [access pipeline\n", + "steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)\n", + "by name or position. Then you can query the attributes of that step such as\n", + "`coef_`." ] }, { @@ -144,12 +251,12 @@ "outputs": [], "source": [ "# solution\n", - "weights_ridge = []\n", + "lr_weights = []\n", "for C in Cs:\n", " logistic_regression.set_params(logisticregression__C=C)\n", " logistic_regression.fit(data_train, target_train)\n", " coefs = logistic_regression[-1].coef_[0]\n", - " weights_ridge.append(pd.Series(coefs, index=culmen_columns))" + " lr_weights.append(pd.Series(coefs, index=culmen_columns))" ] }, { @@ -162,8 +269,8 @@ }, "outputs": [], "source": [ - "weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f\"C: {C}\" for C in Cs])\n", - "weights_ridge.plot.barh()\n", + "lr_weights = pd.concat(lr_weights, axis=1, keys=[f\"C: {C}\" for C in Cs])\n", + "lr_weights.plot.barh()\n", "_ = plt.title(\"LogisticRegression weights depending of C\")" ] }, @@ -175,14 +282,101 @@ ] }, "source": [ - "We see that a small `C` will shrink the weights values toward zero. It means\n", - "that a small `C` provides a more regularized model. Thus, `C` is the inverse\n", - "of the `alpha` coefficient in the `Ridge` model.\n", "\n", - "Besides, with a strong penalty (i.e. small `C` value), the weight of the\n", - "feature \"Culmen Depth (mm)\" is almost zero. It explains why the decision\n", + "As small `C` provides a more regularized model, it shrinks the weights values\n", + "toward zero, as in the `Ridge` model.\n", + "\n", + "In particular, with a strong penalty (e.g. `C = 0.01`), the weight of the feature\n", + "named \"Culmen Depth (mm)\" is almost zero. It explains why the decision\n", "separation in the plot is almost perpendicular to the \"Culmen Length (mm)\"\n", - "feature." + "feature.\n", + "\n", + "For even stronger penalty strengths (e.g. `C = 1e-6`), the weights of both\n", + "features are almost zero. It explains why the decision separation in the plot\n", + "is almost constant in the feature space: the predicted probability is only\n", + "based on the intercept parameter of the model (which is never regularized)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Impact of the regularization on with non-linear feature engineering\n", + "\n", + "Use the `plot_decision_boundary` function to repeat the experiment using a\n", + "non-linear feature engineering pipeline. For such purpose, insert\n", + "`Nystroem(kernel=\"rbf\", gamma=1, n_components=100)` between the\n", + "`StandardScaler` and the `LogisticRegression` steps.\n", + "\n", + "- Does the value of `C` still impact the position of the decision boundary and\n", + " the confidence of the model?\n", + "- What can you say about the impact of `C` on the underfitting vs overfitting\n", + " trade-off?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.kernel_approximation import Nystroem\n", + "\n", + "# solution\n", + "classifier = make_pipeline(\n", + " StandardScaler(),\n", + " Nystroem(kernel=\"rbf\", gamma=1.0, n_components=100, random_state=0),\n", + " LogisticRegression(penalty=\"l2\", max_iter=1000),\n", + ")\n", + "\n", + "for C in Cs:\n", + " classifier.set_params(logisticregression__C=C)\n", + " plot_decision_boundary(classifier)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "solution" + ] + }, + "source": [ + "\n", + "- For the lowest values of `C`, the overall pipeline underfits: it predicts\n", + " the majority class everywhere, as previously.\n", + "- When `C` increases, the models starts to predict some datapoints from the\n", + " \"Chinstrap\" class but the model is not very confident anywhere in the\n", + " feature space.\n", + "- The decision boundary is no longer a straight line: the linear model is now\n", + " classifying in the 100-dimensional feature space created by the `Nystroem`\n", + " transformer. As are result, the decision boundary induced by the overall\n", + " pipeline is now expressive enough to wrap around the minority class.\n", + "- For `C = 1` in particular, it finds a smooth red blob around most of the\n", + " \"Chinstrap\" data points. When moving away from the data points, the model is\n", + " less confident in its predictions and again tends to predict the majority\n", + " class according to the proportion in the training set.\n", + "- For higher values of `C`, the model starts to overfit: it is very confident\n", + " in its predictions almost everywhere, but it should not be trusted: the\n", + " model also makes a larger number of mistakes on the test set (not shown in\n", + " the plot) while adopting a very curvy decision boundary to attempt fitting\n", + " all the training points, including the noisy ones at the frontier between\n", + " the two classes. This makes the decision boundary very sensitive to the\n", + " sampling of the training set and as a result, it does not generalize well in\n", + " that region. This is confirmed by the (slightly) lower accuracy on the test\n", + " set.\n", + "\n", + "Finally, we can also note that the linear model on the raw features was as\n", + "good or better than the best model using non-linear feature engineering. So in\n", + "this case, we did not really need this extra complexity in our pipeline.\n", + "**Simpler is better!**\n", + "\n", + "So to conclude, when using non-linear feature engineering, it is often\n", + "possible to make the pipeline overfit, even if the original feature space is\n", + "low-dimensional. As a result, it is important to tune the regularization\n", + "parameter in conjunction with the parameters of the transformers (e.g. tuning\n", + "`gamma` would be important here). This has a direct impact on the certainty of\n", + "the predictions." ] } ], diff --git a/notebooks/parameter_tuning_ex_02.ipynb b/notebooks/parameter_tuning_ex_02.ipynb index 46345e86b..2aa096d5c 100644 --- a/notebooks/parameter_tuning_ex_02.ipynb +++ b/notebooks/parameter_tuning_ex_02.ipynb @@ -57,7 +57,6 @@ " )\n", " ],\n", " remainder=\"passthrough\",\n", - " sparse_threshold=0,\n", ")\n", "\n", "from sklearn.ensemble import HistGradientBoostingClassifier\n", diff --git a/notebooks/parameter_tuning_grid_search.ipynb b/notebooks/parameter_tuning_grid_search.ipynb index d26aff083..e0912cb54 100644 --- a/notebooks/parameter_tuning_grid_search.ipynb +++ b/notebooks/parameter_tuning_grid_search.ipynb @@ -157,7 +157,6 @@ "preprocessor = ColumnTransformer(\n", " [(\"cat_preprocessor\", categorical_preprocessor, categorical_columns)],\n", " remainder=\"passthrough\",\n", - " sparse_threshold=0,\n", ")" ] }, diff --git a/notebooks/parameter_tuning_nested.ipynb b/notebooks/parameter_tuning_nested.ipynb index b7c14a3bf..efc43173d 100644 --- a/notebooks/parameter_tuning_nested.ipynb +++ b/notebooks/parameter_tuning_nested.ipynb @@ -70,7 +70,6 @@ " (\"cat_preprocessor\", categorical_preprocessor, categorical_columns),\n", " ],\n", " remainder=\"passthrough\",\n", - " sparse_threshold=0,\n", ")" ] }, diff --git a/notebooks/parameter_tuning_randomized_search.ipynb b/notebooks/parameter_tuning_randomized_search.ipynb index 11bfac389..3189e9301 100644 --- a/notebooks/parameter_tuning_randomized_search.ipynb +++ b/notebooks/parameter_tuning_randomized_search.ipynb @@ -121,7 +121,6 @@ "preprocessor = ColumnTransformer(\n", " [(\"cat_preprocessor\", categorical_preprocessor, categorical_columns)],\n", " remainder=\"passthrough\",\n", - " sparse_threshold=0,\n", ")" ] }, diff --git a/notebooks/parameter_tuning_sol_02.ipynb b/notebooks/parameter_tuning_sol_02.ipynb index bbcb42f88..58ef6a501 100644 --- a/notebooks/parameter_tuning_sol_02.ipynb +++ b/notebooks/parameter_tuning_sol_02.ipynb @@ -57,7 +57,6 @@ " )\n", " ],\n", " remainder=\"passthrough\",\n", - " sparse_threshold=0,\n", ")\n", "\n", "from sklearn.ensemble import HistGradientBoostingClassifier\n", diff --git a/notebooks/trees_classification.ipynb b/notebooks/trees_classification.ipynb index dfcae831c..22eae1fca 100644 --- a/notebooks/trees_classification.ipynb +++ b/notebooks/trees_classification.ipynb @@ -6,8 +6,11 @@ "source": [ "# Build a classification decision tree\n", "\n", - "We will illustrate how decision tree fit data with a simple classification\n", - "problem using the penguins dataset." + "In this notebook we illustrate decision trees in a multiclass classification\n", + "problem by using the penguins dataset with 2 features and 3 classes.\n", + "\n", + "For the sake of simplicity, we focus the discussion on the hyperparamter\n", + "`max_depth`, which controls the maximal depth of the decision tree." ] }, { @@ -38,8 +41,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Besides, we split the data into two subsets to investigate how trees will\n", - "predict values based on an out-of-samples dataset." + "First, we split the data into two subsets to investigate how trees predict\n", + "values based on unseen data." ] }, { @@ -60,16 +63,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In a previous notebook, we learnt that a linear classifier will define a\n", - "linear separation to split classes using a linear combination of the input\n", - "features. In our 2-dimensional space, it means that a linear classifier will\n", - "define some oblique lines that best separate our classes. We define a function\n", - "below that, given a set of data points and a classifier, will plot the\n", - "decision boundaries learnt by the classifier.\n", - "\n", - "Thus, for a linear classifier, we will obtain the following decision\n", - "boundaries. These boundaries lines indicate where the model changes its\n", - "prediction from one class to another." + "In a previous notebook, we learnt that linear classifiers define a linear\n", + "separation to split classes using a linear combination of the input features.\n", + "In our 2-dimensional feature space, it means that a linear classifier finds\n", + "the oblique lines that best separate the classes. This is still true for\n", + "multiclass problems, except that more than one line is fitted. We can use\n", + "`DecisionBoundaryDisplay` to plot the decision boundaries learnt by the\n", + "classifier." ] }, { @@ -91,15 +91,22 @@ "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", + "import matplotlib as mpl\n", "import seaborn as sns\n", "\n", "from sklearn.inspection import DecisionBoundaryDisplay\n", "\n", + "tab10_norm = mpl.colors.Normalize(vmin=-0.5, vmax=8.5)\n", "# create a palette to be used in the scatterplot\n", - "palette = [\"tab:red\", \"tab:blue\", \"black\"]\n", + "palette = [\"tab:blue\", \"tab:green\", \"tab:orange\"]\n", "\n", - "DecisionBoundaryDisplay.from_estimator(\n", - " linear_model, data_train, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", + "dbd = DecisionBoundaryDisplay.from_estimator(\n", + " linear_model,\n", + " data_train,\n", + " response_method=\"predict\",\n", + " cmap=\"tab10\",\n", + " norm=tab10_norm,\n", + " alpha=0.5,\n", ")\n", "sns.scatterplot(\n", " data=penguins,\n", @@ -119,7 +126,7 @@ "source": [ "We see that the lines are a combination of the input features since they are\n", "not perpendicular a specific axis. Indeed, this is due to the model\n", - "parametrization that we saw in the previous notebook, controlled by the\n", + "parametrization that we saw in some previous notebooks, i.e. controlled by the\n", "model's weights and intercept.\n", "\n", "Besides, it seems that the linear model would be a good candidate for such\n", @@ -141,13 +148,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Unlike linear models, decision trees are non-parametric models: they are not\n", - "controlled by a mathematical decision function and do not have weights or\n", - "intercept to be optimized.\n", + "Unlike linear models, the decision rule for the decision tree is not\n", + "controlled by a simple linear combination of weights and feature values.\n", + "\n", + "Instead, the decision rules of trees can be defined in terms of\n", + "- the feature index used at each split node of the tree,\n", + "- the threshold value used at each split node,\n", + "- the value to predict at each leaf node.\n", + "\n", + "Decision trees partition the feature space by considering a single feature at\n", + "a time. The number of splits depends on both the hyperparameters and the\n", + "number of data points in the training set: the more flexible the\n", + "hyperparameters and the larger the training set, the more splits can be\n", + "considered by the model.\n", "\n", - "Indeed, decision trees will partition the space by considering a single\n", - "feature at a time. Let's illustrate this behaviour by having a decision tree\n", - "make a single split to partition the feature space." + "As the number of adjustable components taking part in the decision rule\n", + "changes with the training size, we say that decision trees are non-parametric\n", + "models.\n", + "\n", + "Let's now visualize the shape of the decision boundary of a decision tree when\n", + "we set the `max_depth` hyperparameter to only allow for a single split to\n", + "partition the feature space." ] }, { @@ -169,7 +190,12 @@ "outputs": [], "source": [ "DecisionBoundaryDisplay.from_estimator(\n", - " tree, data_train, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", + " tree,\n", + " data_train,\n", + " response_method=\"predict\",\n", + " cmap=\"tab10\",\n", + " norm=tab10_norm,\n", + " alpha=0.5,\n", ")\n", "sns.scatterplot(\n", " data=penguins,\n", @@ -188,8 +214,8 @@ "source": [ "The partitions found by the algorithm separates the data along the axis\n", "\"Culmen Depth\", discarding the feature \"Culmen Length\". Thus, it highlights\n", - "that a decision tree does not use a combination of feature when making a\n", - "split. We can look more in depth at the tree structure." + "that a decision tree does not use a combination of features when making a\n", + "single split. We can look more in depth at the tree structure." ] }, { @@ -230,16 +256,16 @@ "dataset was subdivided into 2 sets based on the culmen depth (inferior or\n", "superior to 16.45 mm).\n", "\n", - "This partition of the dataset minimizes the class diversities in each\n", + "This partition of the dataset minimizes the class diversity in each\n", "sub-partitions. This measure is also known as a **criterion**, and is a\n", "settable parameter.\n", "\n", "If we look more closely at the partition, we see that the sample superior to\n", - "16.45 belongs mainly to the Adelie class. Looking at the values, we indeed\n", - "observe 103 Adelie individuals in this space. We also count 52 Chinstrap\n", - "samples and 6 Gentoo samples. We can make similar interpretation for the\n", + "16.45 belongs mainly to the \"Adelie\" class. Looking at the values, we indeed\n", + "observe 103 \"Adelie\" individuals in this space. We also count 52 \"Chinstrap\"\n", + "samples and 6 \"Gentoo\" samples. We can make similar interpretation for the\n", "partition defined by a threshold inferior to 16.45mm. In this case, the most\n", - "represented class is the Gentoo species.\n", + "represented class is the \"Gentoo\" species.\n", "\n", "Let's see how our tree would work as a predictor. Let's start with a case\n", "where the culmen depth is inferior to the threshold." @@ -251,15 +277,17 @@ "metadata": {}, "outputs": [], "source": [ - "sample_1 = pd.DataFrame({\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [15]})\n", - "tree.predict(sample_1)" + "test_penguin_1 = pd.DataFrame(\n", + " {\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [15]}\n", + ")\n", + "tree.predict(test_penguin_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The class predicted is the Gentoo. We can now check what happens if we pass a\n", + "The class predicted is the \"Gentoo\". We can now check what happens if we pass a\n", "culmen depth superior to the threshold." ] }, @@ -269,17 +297,19 @@ "metadata": {}, "outputs": [], "source": [ - "sample_2 = pd.DataFrame({\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [17]})\n", - "tree.predict(sample_2)" + "test_penguin_2 = pd.DataFrame(\n", + " {\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [17]}\n", + ")\n", + "tree.predict(test_penguin_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In this case, the tree predicts the Adelie specie.\n", + "In this case, the tree predicts the \"Adelie\" specie.\n", "\n", - "Thus, we can conclude that a decision tree classifier will predict the most\n", + "Thus, we can conclude that a decision tree classifier predicts the most\n", "represented class within a partition.\n", "\n", "During the training, we have a count of samples in each partition, we can also\n", @@ -293,7 +323,7 @@ "metadata": {}, "outputs": [], "source": [ - "y_pred_proba = tree.predict_proba(sample_2)\n", + "y_pred_proba = tree.predict_proba(test_penguin_2)\n", "y_proba_class_0 = pd.Series(y_pred_proba[0], index=tree.classes_)" ] }, @@ -338,8 +368,8 @@ "metadata": {}, "source": [ "It is also important to note that the culmen length has been disregarded for\n", - "the moment. It means that whatever the value given, it will not be used during\n", - "the prediction." + "the moment. It means that regardless of its value, it is not used during the\n", + "prediction." ] }, { @@ -348,10 +378,10 @@ "metadata": {}, "outputs": [], "source": [ - "sample_3 = pd.DataFrame(\n", + "test_penguin_3 = pd.DataFrame(\n", " {\"Culmen Length (mm)\": [10_000], \"Culmen Depth (mm)\": [17]}\n", ")\n", - "tree.predict_proba(sample_3)" + "tree.predict_proba(test_penguin_3)" ] }, { @@ -378,12 +408,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Indeed, it is not a surprise. We saw earlier that a single feature will not be\n", - "able to separate all three species. However, from the previous analysis we saw\n", - "that by using both features we should be able to get fairly good results.\n", + "Indeed, it is not a surprise. We saw earlier that a single feature is not able\n", + "to separate all three species: it underfits. However, from the previous\n", + "analysis we saw that by using both features we should be able to get fairly\n", + "good results.\n", "\n", - "In the next exercise, you will increase the size of the tree depth. You will\n", - "get intuitions on how the space partitioning is repeated over time." + "In the next exercise, you will increase the tree depth to get an intuition on\n", + "how such a parameter affects the space partitioning." ] } ], diff --git a/notebooks/trees_sol_01.ipynb b/notebooks/trees_sol_01.ipynb index c126f23fa..2ce0c1b8b 100644 --- a/notebooks/trees_sol_01.ipynb +++ b/notebooks/trees_sol_01.ipynb @@ -6,16 +6,13 @@ "source": [ "# \ud83d\udcc3 Solution for Exercise M5.01\n", "\n", - "In the previous notebook, we showed how a tree with a depth of 1 level was\n", - "working. The aim of this exercise is to repeat part of the previous experiment\n", - "for a depth with 2 levels to show how the process of partitioning is repeated\n", - "over time.\n", + "In the previous notebook, we showed how a tree with 1 level depth works. The\n", + "aim of this exercise is to repeat part of the previous experiment for a tree\n", + "with 2 levels depth to show how such parameter affects the feature space\n", + "partitioning.\n", "\n", - "Before to start, we will:\n", - "\n", - "* load the dataset;\n", - "* split the dataset into training and testing dataset;\n", - "* define the function to show the classification decision function." + "We first load the penguins dataset and split it into a training and a testing\n", + "sets:" ] }, { @@ -61,10 +58,7 @@ "metadata": {}, "source": [ "Create a decision tree classifier with a maximum depth of 2 levels and fit the\n", - "training data. Once this classifier trained, plot the data and the decision\n", - "boundary to see the benefit of increasing the depth. To plot the decision\n", - "boundary, you should import the class `DecisionBoundaryDisplay` from the\n", - "module `sklearn.inspection` as shown in the previous course notebook." + "training data." ] }, { @@ -80,24 +74,49 @@ "tree.fit(data_train, target_train)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now plot the data and the decision boundary of the trained classifier to see\n", + "the effect of increasing the depth of the tree.\n", + "\n", + "Hint: Use the class `DecisionBoundaryDisplay` from the module\n", + "`sklearn.inspection` as shown in previous course notebooks.\n", + "\n", + "
Warning
\n", + "At this time, it is not possible to use response_method=\"predict_proba\" for\n", + "multiclass problems. This is a planned feature for a future version of\n", + "scikit-learn. In the mean time, you can use response_method=\"predict\"\n", + "instead.
\n", + "Note
\n", + "You may have noticed that we are no longer using a diverging colormap. Indeed,\n", + "the chance level for a one-vs-rest binarization of the multi-class\n", + "classification problem is almost never at predicted probability of 0.5. So\n", + "using a colormap with a neutral white at 0.5 might give a false impression on\n", + "the certainty.
\n", + "