[ci skip] Rework ordering of linear models module (#701)

Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> cbfedd1
INRIA · Sep 1, 2023 · 999550c · 999550c
1 parent e44a0f4
commit 999550c
Show file tree

Hide file tree

Showing 219 changed files with 4,667 additions and 12,900 deletions.
diff --git a/_images/2aa9c74ca7d3918e85d1d490604c2013919666cbfcee89cd3577063008dd4dca.png b/_images/2aa9c74ca7d3918e85d1d490604c2013919666cbfcee89cd3577063008dd4dca.png
diff --git a/_images/3ae88c1adb5600e88dcd4d7f434880904a931162faafab94b4394049eadddaf8.png b/_images/3ae88c1adb5600e88dcd4d7f434880904a931162faafab94b4394049eadddaf8.png
diff --git a/_images/51daaad6ea2e2328ec1d58bb641947bb7cff530b77da8d11f338d47492c5112a.png b/_images/51daaad6ea2e2328ec1d58bb641947bb7cff530b77da8d11f338d47492c5112a.png
diff --git a/_images/5e87ed12d41bd11652e972f36d0fefe6c206f8e67d999fad5c5b55489ec19072.png b/_images/5e87ed12d41bd11652e972f36d0fefe6c206f8e67d999fad5c5b55489ec19072.png
diff --git a/_images/5faeaf5a63f1f570ab9b6b920ba94bdcafd036237617f7879cba31bc695125f9.png b/_images/5faeaf5a63f1f570ab9b6b920ba94bdcafd036237617f7879cba31bc695125f9.png
diff --git a/_images/70aa7da15c40d36cab324eaa56215e10e389aa3b3456119a1391ca9aa3cc0f09.png b/_images/70aa7da15c40d36cab324eaa56215e10e389aa3b3456119a1391ca9aa3cc0f09.png
diff --git a/_images/83c96f049a067da857eafe1958561bf17ff329b3c69e9f7f155c514f8e7ce764.png b/_images/83c96f049a067da857eafe1958561bf17ff329b3c69e9f7f155c514f8e7ce764.png
diff --git a/_images/8c34c8c6a10f9dba07ff6399d462fc16fcf74a2e09b7e2880d34ed951cb7e5db.png b/_images/8c34c8c6a10f9dba07ff6399d462fc16fcf74a2e09b7e2880d34ed951cb7e5db.png
diff --git a/_images/9dc7f8ded955ee64b8ccd8404f95aa3ebbf5cd6670d0de12fa606c80a0aa701a.png b/_images/9dc7f8ded955ee64b8ccd8404f95aa3ebbf5cd6670d0de12fa606c80a0aa701a.png
diff --git a/_images/a62d184ac201155b1288e8196b70bc90ea08f89a015c1a806df7752664499a50.png b/_images/a62d184ac201155b1288e8196b70bc90ea08f89a015c1a806df7752664499a50.png
diff --git a/_images/ddc421df0dc5685d0cf6e79da2a0be5974065de8c76317cc2a8659c6d6cbcbff.png b/_images/ddc421df0dc5685d0cf6e79da2a0be5974065de8c76317cc2a8659c6d6cbcbff.png
diff --git a/_sources/linear_models/linear_models_classification_index.md b/_sources/linear_models/linear_models_classification_index.md
diff --git a/_sources/linear_models/linear_models_non_linear_index.md b/_sources/linear_models/linear_models_non_linear_index.md
@@ -1,4 +1,4 @@
-# Modelling non-linear features-target relationships
+# Non-linear feature engineering for linear models
 
 ```{tableofcontents}
 

diff --git a/_sources/linear_models/linear_models_regression_index.md b/_sources/linear_models/linear_models_regression_index.md
diff --git a/_sources/python_scripts/linear_models_ex_02.py b/_sources/python_scripts/linear_models_ex_02.py
@@ -14,100 +14,80 @@
 # %% [markdown]
 # # 📝 Exercise M4.02
 #
-# The goal of this exercise is to build an intuition on what will be the
-# parameters' values of a linear model when the link between the data and the
-# target is non-linear.
+# In the previous notebook, we showed that we can add new features based on the
+# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
+# In that case we only used a single feature in `data`.
 #
-# First, we will generate such non-linear data.
+# The aim of this notebook is to train a linear regression algorithm on a
+# dataset with more than a single feature. In such a "multi-dimensional" feature
+# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
+# etc. Products of features are usually called "non-linear or
+# multiplicative interactions" between features.
 #
-# ```{tip}
-# `np.random.RandomState` allows to create a random number generator which can
-# be later used to get deterministic results.
-# ```
-
-# %%
-import numpy as np
-
-# Set the seed for reproduction
-rng = np.random.RandomState(0)
-
-# Generate data
-n_sample = 100
-data_max, data_min = 1.4, -1.4
-len_data = data_max - data_min
-data = rng.rand(n_sample) * len_data - len_data / 2
-noise = rng.randn(n_sample) * 0.3
-target = data**3 - 0.5 * data**2 + noise
+# Feature engineering can be an important step of a model pipeline as long as
+# the new features are expected to be predictive. For instance, think of a
+# classification model to decide if a patient has risk of developing a heart
+# disease. This would depend on the patient's Body Mass Index which is defined
+# as `weight / height ** 2`.
+#
+# We load the dataset penguins dataset. We first use a set of 3 numerical
+# features to predict the target, i.e. the body mass of the penguin.
 
 # %% [markdown]
 # ```{note}
-# To ease the plotting, we will create a Pandas dataframe containing the data
-# and target
+# If you want a deeper overview regarding this dataset, you can refer to the
+# Appendix - Datasets description section at the end of this MOOC.
 # ```
 
 # %%
 import pandas as pd
 
-full_data = pd.DataFrame({"data": data, "target": target})
+penguins = pd.read_csv("../datasets/penguins.csv")
 
-# %%
-import seaborn as sns
+columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
+target_name = "Body Mass (g)"
 
-_ = sns.scatterplot(
-    data=full_data, x="data", y="target", color="black", alpha=0.5
-)
+# Remove lines with missing values for the columns of interest
+penguins_non_missing = penguins[columns + [target_name]].dropna()
 
-# %% [markdown]
-# We observe that the link between the data `data` and vector `target` is
-# non-linear. For instance, `data` could represent the years of experience
-# (normalized) and `target` the salary (normalized). Therefore, the problem here
-# would be to infer the salary given the years of experience.
-#
-# Using the function `f` defined below, find both the `weight` and the
-# `intercept` that you think will lead to a good linear model. Plot both the
-# data and the predictions of this model.
-
-
-# %%
-def f(data, weight=0, intercept=0):
-    target_predict = weight * data + intercept
-    return target_predict
+data = penguins_non_missing[columns]
+target = penguins_non_missing[target_name]
+data.head()
 
+# %% [markdown]
+# Now it is your turn to train a linear regression model on this dataset. First,
+# create a linear regression model.
 
 # %%
 # Write your code here.
 
 # %% [markdown]
-# Compute the mean squared error for this model
+# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
+# as metric.
 
 # %%
 # Write your code here.
 
 # %% [markdown]
-# Train a linear regression model on this dataset.
-#
-# ```{warning}
-# In scikit-learn, by convention `data` (also called `X` in the scikit-learn
-# documentation) should be a 2D matrix of shape `(n_samples, n_features)`.
-# If `data` is a 1D vector, you need to reshape it into a matrix with a
-# single column if the vector represents a feature or a single row if the
-# vector represents a sample.
-# ```
+# Compute the mean and std of the MAE in grams (g).
 
 # %%
-from sklearn.linear_model import LinearRegression
-
 # Write your code here.
 
 # %% [markdown]
-# Compute predictions from the linear regression model and plot both the data
-# and the predictions.
+# Now create a pipeline using `make_pipeline` consisting of a
+# `PolynomialFeatures` and a linear regression. Set `degree=2` and
+# `interaction_only=True` to the feature engineering step. Remember not to
+# include the bias to avoid redundancies with the linear's regression intercept.
+#
+# Use the same strategy as before to cross-validate such a pipeline.
 
 # %%
 # Write your code here.
 
 # %% [markdown]
-# Compute the mean squared error
+# Compute the mean and std of the MAE in grams (g) and compare with the results
+# without feature engineering.
 
 # %%
 # Write your code here.
diff --git a/_sources/python_scripts/linear_models_ex_03.py b/_sources/python_scripts/linear_models_ex_03.py
@@ -14,24 +14,14 @@
 # %% [markdown]
 # # 📝 Exercise M4.03
 #
-# In the previous notebook, we showed that we can add new features based on the
-# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
-# In that case we only used a single feature in `data`.
+# The parameter `penalty` can control the **type** of regularization to use,
+# whereas the regularization **strength** is set using the parameter `C`.
+# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
+# this exercise, we ask you to train a logistic regression classifier using the
+# `penalty="l2"` regularization (which happens to be the default in
+# scikit-learn) to find by yourself the effect of the parameter `C`.
 #
-# The aim of this notebook is to train a linear regression algorithm on a
-# dataset with more than a single feature. In such a "multi-dimensional" feature
-# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
-# etc. Products of features are usually called "non-linear or
-# multiplicative interactions" between features.
-#
-# Feature engineering can be an important step of a model pipeline as long as
-# the new features are expected to be predictive. For instance, think of a
-# classification model to decide if a patient has risk of developing a heart
-# disease. This would depend on the patient's Body Mass Index which is defined
-# as `weight / height ** 2`.
-#
-# We load the dataset penguins dataset. We first use a set of 3 numerical
-# features to predict the target, i.e. the body mass of the penguin.
+# We start by loading the dataset.
 
 # %% [markdown]
 # ```{note}
@@ -42,52 +32,51 @@
 # %%
 import pandas as pd
 
-penguins = pd.read_csv("../datasets/penguins.csv")
+penguins = pd.read_csv("../datasets/penguins_classification.csv")
+# only keep the Adelie and Chinstrap classes
+penguins = (
+    penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
+)
 
-columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
-target_name = "Body Mass (g)"
+culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
+target_column = "Species"
 
-# Remove lines with missing values for the columns of interest
-penguins_non_missing = penguins[columns + [target_name]].dropna()
+# %%
+from sklearn.model_selection import train_test_split
 
-data = penguins_non_missing[columns]
-target = penguins_non_missing[target_name]
-data.head()
+penguins_train, penguins_test = train_test_split(penguins, random_state=0)
 
-# %% [markdown]
-# Now it is your turn to train a linear regression model on this dataset. First,
-# create a linear regression model.
+data_train = penguins_train[culmen_columns]
+data_test = penguins_test[culmen_columns]
 
-# %%
-# Write your code here.
+target_train = penguins_train[target_column]
+target_test = penguins_test[target_column]
 
 # %% [markdown]
-# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
-# as metric.
+# First, let's create our predictive model.
 
 # %%
-# Write your code here.
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression
 
-# %% [markdown]
-# Compute the mean and std of the MAE in grams (g).
-
-# %%
-# Write your code here.
+logistic_regression = make_pipeline(
+    StandardScaler(), LogisticRegression(penalty="l2")
+)
 
 # %% [markdown]
-# Now create a pipeline using `make_pipeline` consisting of a
-# `PolynomialFeatures` and a linear regression. Set `degree=2` and
-# `interaction_only=True` to the feature engineering step. Remember not to
-# include the bias to avoid redundancies with the linear's regression intercept.
-#
-# Use the same strategy as before to cross-validate such a pipeline.
+# Given the following candidates for the `C` parameter, find out the impact of
+# `C` on the classifier decision boundary. You can use
+# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
+# decision function boundary.
 
 # %%
+Cs = [0.01, 0.1, 1, 10]
+
 # Write your code here.
 
 # %% [markdown]
-# Compute the mean and std of the MAE in grams (g) and compare with the results
-# without feature engineering.
+# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
 
 # %%
 # Write your code here.
diff --git a/_sources/python_scripts/linear_models_ex_04.py b/_sources/python_scripts/linear_models_ex_04.py