INRIA · ogrisel · Oct 26, 2023 · Oct 26, 2023
diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb
diff --git a/notebooks/parameter_tuning_ex_02.ipynb b/notebooks/parameter_tuning_ex_02.ipynb
@@ -57,7 +57,6 @@
     "        )\n",
     "    ],\n",
     "    remainder=\"passthrough\",\n",
-    "    sparse_threshold=0,\n",
     ")\n",
     "\n",
     "from sklearn.ensemble import HistGradientBoostingClassifier\n",

diff --git a/notebooks/parameter_tuning_grid_search.ipynb b/notebooks/parameter_tuning_grid_search.ipynb
@@ -157,7 +157,6 @@
     "preprocessor = ColumnTransformer(\n",
     "    [(\"cat_preprocessor\", categorical_preprocessor, categorical_columns)],\n",
     "    remainder=\"passthrough\",\n",
-    "    sparse_threshold=0,\n",
     ")"
    ]
   },

diff --git a/notebooks/parameter_tuning_nested.ipynb b/notebooks/parameter_tuning_nested.ipynb
@@ -70,7 +70,6 @@
     "        (\"cat_preprocessor\", categorical_preprocessor, categorical_columns),\n",
     "    ],\n",
     "    remainder=\"passthrough\",\n",
-    "    sparse_threshold=0,\n",
     ")"
    ]
   },

diff --git a/notebooks/parameter_tuning_randomized_search.ipynb b/notebooks/parameter_tuning_randomized_search.ipynb
@@ -121,7 +121,6 @@
     "preprocessor = ColumnTransformer(\n",
     "    [(\"cat_preprocessor\", categorical_preprocessor, categorical_columns)],\n",
     "    remainder=\"passthrough\",\n",
-    "    sparse_threshold=0,\n",
     ")"
    ]
   },

diff --git a/notebooks/parameter_tuning_sol_02.ipynb b/notebooks/parameter_tuning_sol_02.ipynb
@@ -57,7 +57,6 @@
     "        )\n",
     "    ],\n",
     "    remainder=\"passthrough\",\n",
-    "    sparse_threshold=0,\n",
     ")\n",
     "\n",
     "from sklearn.ensemble import HistGradientBoostingClassifier\n",

diff --git a/notebooks/trees_classification.ipynb b/notebooks/trees_classification.ipynb
@@ -6,8 +6,11 @@
    "source": [
     "# Build a classification decision tree\n",
     "\n",
-    "We will illustrate how decision tree fit data with a simple classification\n",
-    "problem using the penguins dataset."
+    "In this notebook we illustrate decision trees in a multiclass classification\n",
+    "problem by using the penguins dataset with 2 features and 3 classes.\n",
+    "\n",
+    "For the sake of simplicity, we focus the discussion on the hyperparamter\n",
+    "`max_depth`, which controls the maximal depth of the decision tree."
    ]
   },
   {
@@ -38,8 +41,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Besides, we split the data into two subsets to investigate how trees will\n",
-    "predict values based on an out-of-samples dataset."
+    "First, we split the data into two subsets to investigate how trees predict\n",
+    "values based on unseen data."
    ]
   },
   {
@@ -60,16 +63,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In a previous notebook, we learnt that a linear classifier will define a\n",
-    "linear separation to split classes using a linear combination of the input\n",
-    "features. In our 2-dimensional space, it means that a linear classifier will\n",
-    "define some oblique lines that best separate our classes. We define a function\n",
-    "below that, given a set of data points and a classifier, will plot the\n",
-    "decision boundaries learnt by the classifier.\n",
-    "\n",
-    "Thus, for a linear classifier, we will obtain the following decision\n",
-    "boundaries. These boundaries lines indicate where the model changes its\n",
-    "prediction from one class to another."
+    "In a previous notebook, we learnt that linear classifiers define a linear\n",
+    "separation to split classes using a linear combination of the input features.\n",
+    "In our 2-dimensional feature space, it means that a linear classifier finds\n",
+    "the oblique lines that best separate the classes. This is still true for\n",
+    "multiclass problems, except that more than one line is fitted. We can use\n",
+    "`DecisionBoundaryDisplay` to plot the decision boundaries learnt by the\n",
+    "classifier."
    ]
   },
   {
@@ -91,15 +91,22 @@
    "outputs": [],
    "source": [
     "import matplotlib.pyplot as plt\n",
+    "import matplotlib as mpl\n",
     "import seaborn as sns\n",
     "\n",
     "from sklearn.inspection import DecisionBoundaryDisplay\n",
     "\n",
+    "tab10_norm = mpl.colors.Normalize(vmin=-0.5, vmax=8.5)\n",
     "# create a palette to be used in the scatterplot\n",
-    "palette = [\"tab:red\", \"tab:blue\", \"black\"]\n",
+    "palette = [\"tab:blue\", \"tab:green\", \"tab:orange\"]\n",
     "\n",
-    "DecisionBoundaryDisplay.from_estimator(\n",
-    "    linear_model, data_train, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n",
+    "dbd = DecisionBoundaryDisplay.from_estimator(\n",
+    "    linear_model,\n",
+    "    data_train,\n",
+    "    response_method=\"predict\",\n",
+    "    cmap=\"tab10\",\n",
+    "    norm=tab10_norm,\n",
+    "    alpha=0.5,\n",
     ")\n",
     "sns.scatterplot(\n",
     "    data=penguins,\n",
@@ -119,7 +126,7 @@
    "source": [
     "We see that the lines are a combination of the input features since they are\n",
     "not perpendicular a specific axis. Indeed, this is due to the model\n",
-    "parametrization that we saw in the previous notebook, controlled by the\n",
+    "parametrization that we saw in some previous notebooks, i.e. controlled by the\n",
     "model's weights and intercept.\n",
     "\n",
     "Besides, it seems that the linear model would be a good candidate for such\n",
@@ -141,13 +148,27 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Unlike linear models, decision trees are non-parametric models: they are not\n",
-    "controlled by a mathematical decision function and do not have weights or\n",
-    "intercept to be optimized.\n",
+    "Unlike linear models, the decision rule for the decision tree is not\n",
+    "controlled by a simple linear combination of weights and feature values.\n",
+    "\n",
+    "Instead, the decision rules of trees can be defined in terms of\n",
+    "- the feature index used at each split node of the tree,\n",
+    "- the threshold value used at each split node,\n",
+    "- the value to predict at each leaf node.\n",
+    "\n",
+    "Decision trees partition the feature space by considering a single feature at\n",
+    "a time. The number of splits depends on both the hyperparameters and the\n",
+    "number of data points in the training set: the more flexible the\n",
+    "hyperparameters and the larger the training set, the more splits can be\n",
+    "considered by the model.\n",
     "\n",
-    "Indeed, decision trees will partition the space by considering a single\n",
-    "feature at a time. Let's illustrate this behaviour by having a decision tree\n",
-    "make a single split to partition the feature space."
+    "As the number of adjustable components taking part in the decision rule\n",
+    "changes with the training size, we say that decision trees are non-parametric\n",
+    "models.\n",
+    "\n",
+    "Let's now visualize the shape of the decision boundary of a decision tree when\n",
+    "we set the `max_depth` hyperparameter to only allow for a single split to\n",
+    "partition the feature space."
    ]
   },
   {
@@ -169,7 +190,12 @@
    "outputs": [],
    "source": [
     "DecisionBoundaryDisplay.from_estimator(\n",
-    "    tree, data_train, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n",
+    "    tree,\n",
+    "    data_train,\n",
+    "    response_method=\"predict\",\n",
+    "    cmap=\"tab10\",\n",
+    "    norm=tab10_norm,\n",
+    "    alpha=0.5,\n",
     ")\n",
     "sns.scatterplot(\n",
     "    data=penguins,\n",
@@ -188,8 +214,8 @@
    "source": [
     "The partitions found by the algorithm separates the data along the axis\n",
     "\"Culmen Depth\", discarding the feature \"Culmen Length\". Thus, it highlights\n",
-    "that a decision tree does not use a combination of feature when making a\n",
-    "split. We can look more in depth at the tree structure."
+    "that a decision tree does not use a combination of features when making a\n",
+    "single split. We can look more in depth at the tree structure."
    ]
   },
   {
@@ -230,16 +256,16 @@
     "dataset was subdivided into 2 sets based on the culmen depth (inferior or\n",
     "superior to 16.45 mm).\n",
     "\n",
-    "This partition of the dataset minimizes the class diversities in each\n",
+    "This partition of the dataset minimizes the class diversity in each\n",
     "sub-partitions. This measure is also known as a **criterion**, and is a\n",
     "settable parameter.\n",
     "\n",
     "If we look more closely at the partition, we see that the sample superior to\n",
-    "16.45 belongs mainly to the Adelie class. Looking at the values, we indeed\n",
-    "observe 103 Adelie individuals in this space. We also count 52 Chinstrap\n",
-    "samples and 6 Gentoo samples. We can make similar interpretation for the\n",
+    "16.45 belongs mainly to the \"Adelie\" class. Looking at the values, we indeed\n",
+    "observe 103 \"Adelie\" individuals in this space. We also count 52 \"Chinstrap\"\n",
+    "samples and 6 \"Gentoo\" samples. We can make similar interpretation for the\n",
     "partition defined by a threshold inferior to 16.45mm. In this case, the most\n",
-    "represented class is the Gentoo species.\n",
+    "represented class is the \"Gentoo\" species.\n",
     "\n",
     "Let's see how our tree would work as a predictor. Let's start with a case\n",
     "where the culmen depth is inferior to the threshold."
@@ -251,15 +277,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "sample_1 = pd.DataFrame({\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [15]})\n",
-    "tree.predict(sample_1)"
+    "test_penguin_1 = pd.DataFrame(\n",
+    "    {\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [15]}\n",
+    ")\n",
+    "tree.predict(test_penguin_1)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The class predicted is the Gentoo. We can now check what happens if we pass a\n",
+    "The class predicted is the \"Gentoo\". We can now check what happens if we pass a\n",
     "culmen depth superior to the threshold."
    ]
   },
@@ -269,17 +297,19 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "sample_2 = pd.DataFrame({\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [17]})\n",
-    "tree.predict(sample_2)"
+    "test_penguin_2 = pd.DataFrame(\n",
+    "    {\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [17]}\n",
+    ")\n",
+    "tree.predict(test_penguin_2)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this case, the tree predicts the Adelie specie.\n",
+    "In this case, the tree predicts the \"Adelie\" specie.\n",
     "\n",
-    "Thus, we can conclude that a decision tree classifier will predict the most\n",
+    "Thus, we can conclude that a decision tree classifier predicts the most\n",
     "represented class within a partition.\n",
     "\n",
     "During the training, we have a count of samples in each partition, we can also\n",
@@ -293,7 +323,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "y_pred_proba = tree.predict_proba(sample_2)\n",
+    "y_pred_proba = tree.predict_proba(test_penguin_2)\n",
     "y_proba_class_0 = pd.Series(y_pred_proba[0], index=tree.classes_)"
    ]
   },
@@ -338,8 +368,8 @@
    "metadata": {},
    "source": [
     "It is also important to note that the culmen length has been disregarded for\n",
-    "the moment. It means that whatever the value given, it will not be used during\n",
-    "the prediction."
+    "the moment. It means that regardless of its value, it is not used during the\n",
+    "prediction."
    ]
   },
   {
@@ -348,10 +378,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "sample_3 = pd.DataFrame(\n",
+    "test_penguin_3 = pd.DataFrame(\n",
     "    {\"Culmen Length (mm)\": [10_000], \"Culmen Depth (mm)\": [17]}\n",
     ")\n",
-    "tree.predict_proba(sample_3)"
+    "tree.predict_proba(test_penguin_3)"
    ]
   },
   {
@@ -378,12 +408,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Indeed, it is not a surprise. We saw earlier that a single feature will not be\n",
-    "able to separate all three species. However, from the previous analysis we saw\n",
-    "that by using both features we should be able to get fairly good results.\n",
+    "Indeed, it is not a surprise. We saw earlier that a single feature is not able\n",
+    "to separate all three species: it underfits. However, from the previous\n",
+    "analysis we saw that by using both features we should be able to get fairly\n",
+    "good results.\n",
     "\n",
-    "In the next exercise, you will increase the size of the tree depth. You will\n",
-    "get intuitions on how the space partitioning is repeated over time."
+    "In the next exercise, you will increase the tree depth to get an intuition on\n",
+    "how such a parameter affects the space partitioning."
    ]
   }
  ],