diff --git a/_images/2aa9c74ca7d3918e85d1d490604c2013919666cbfcee89cd3577063008dd4dca.png b/_images/2aa9c74ca7d3918e85d1d490604c2013919666cbfcee89cd3577063008dd4dca.png new file mode 100644 index 000000000..87cc418b8 Binary files /dev/null and b/_images/2aa9c74ca7d3918e85d1d490604c2013919666cbfcee89cd3577063008dd4dca.png differ diff --git a/_images/3ae88c1adb5600e88dcd4d7f434880904a931162faafab94b4394049eadddaf8.png b/_images/3ae88c1adb5600e88dcd4d7f434880904a931162faafab94b4394049eadddaf8.png deleted file mode 100644 index 7ce8df53e..000000000 Binary files a/_images/3ae88c1adb5600e88dcd4d7f434880904a931162faafab94b4394049eadddaf8.png and /dev/null differ diff --git a/_images/51daaad6ea2e2328ec1d58bb641947bb7cff530b77da8d11f338d47492c5112a.png b/_images/51daaad6ea2e2328ec1d58bb641947bb7cff530b77da8d11f338d47492c5112a.png deleted file mode 100644 index 07c40c973..000000000 Binary files a/_images/51daaad6ea2e2328ec1d58bb641947bb7cff530b77da8d11f338d47492c5112a.png and /dev/null differ diff --git a/_images/5e87ed12d41bd11652e972f36d0fefe6c206f8e67d999fad5c5b55489ec19072.png b/_images/5e87ed12d41bd11652e972f36d0fefe6c206f8e67d999fad5c5b55489ec19072.png deleted file mode 100644 index 44be3f14a..000000000 Binary files a/_images/5e87ed12d41bd11652e972f36d0fefe6c206f8e67d999fad5c5b55489ec19072.png and /dev/null differ diff --git a/_images/5faeaf5a63f1f570ab9b6b920ba94bdcafd036237617f7879cba31bc695125f9.png b/_images/5faeaf5a63f1f570ab9b6b920ba94bdcafd036237617f7879cba31bc695125f9.png deleted file mode 100644 index b3a6df201..000000000 Binary files a/_images/5faeaf5a63f1f570ab9b6b920ba94bdcafd036237617f7879cba31bc695125f9.png and /dev/null differ diff --git a/_images/70aa7da15c40d36cab324eaa56215e10e389aa3b3456119a1391ca9aa3cc0f09.png b/_images/70aa7da15c40d36cab324eaa56215e10e389aa3b3456119a1391ca9aa3cc0f09.png deleted file mode 100644 index 35bf348db..000000000 Binary files a/_images/70aa7da15c40d36cab324eaa56215e10e389aa3b3456119a1391ca9aa3cc0f09.png and /dev/null differ diff --git a/_images/83c96f049a067da857eafe1958561bf17ff329b3c69e9f7f155c514f8e7ce764.png b/_images/83c96f049a067da857eafe1958561bf17ff329b3c69e9f7f155c514f8e7ce764.png deleted file mode 100644 index 2d5f55fa3..000000000 Binary files a/_images/83c96f049a067da857eafe1958561bf17ff329b3c69e9f7f155c514f8e7ce764.png and /dev/null differ diff --git a/_images/8c34c8c6a10f9dba07ff6399d462fc16fcf74a2e09b7e2880d34ed951cb7e5db.png b/_images/8c34c8c6a10f9dba07ff6399d462fc16fcf74a2e09b7e2880d34ed951cb7e5db.png deleted file mode 100644 index f59475922..000000000 Binary files a/_images/8c34c8c6a10f9dba07ff6399d462fc16fcf74a2e09b7e2880d34ed951cb7e5db.png and /dev/null differ diff --git a/_images/9dc7f8ded955ee64b8ccd8404f95aa3ebbf5cd6670d0de12fa606c80a0aa701a.png b/_images/9dc7f8ded955ee64b8ccd8404f95aa3ebbf5cd6670d0de12fa606c80a0aa701a.png deleted file mode 100644 index 7f7efb8c3..000000000 Binary files a/_images/9dc7f8ded955ee64b8ccd8404f95aa3ebbf5cd6670d0de12fa606c80a0aa701a.png and /dev/null differ diff --git a/_images/a62d184ac201155b1288e8196b70bc90ea08f89a015c1a806df7752664499a50.png b/_images/a62d184ac201155b1288e8196b70bc90ea08f89a015c1a806df7752664499a50.png deleted file mode 100644 index 068324f22..000000000 Binary files a/_images/a62d184ac201155b1288e8196b70bc90ea08f89a015c1a806df7752664499a50.png and /dev/null differ diff --git a/_images/ddc421df0dc5685d0cf6e79da2a0be5974065de8c76317cc2a8659c6d6cbcbff.png b/_images/ddc421df0dc5685d0cf6e79da2a0be5974065de8c76317cc2a8659c6d6cbcbff.png new file mode 100644 index 000000000..b02a13ea5 Binary files /dev/null and b/_images/ddc421df0dc5685d0cf6e79da2a0be5974065de8c76317cc2a8659c6d6cbcbff.png differ diff --git a/_sources/linear_models/linear_models_classification_index.md b/_sources/linear_models/linear_models_classification_index.md deleted file mode 100644 index 81399c436..000000000 --- a/_sources/linear_models/linear_models_classification_index.md +++ /dev/null @@ -1,5 +0,0 @@ -# Linear model for classification - -```{tableofcontents} - -``` diff --git a/_sources/linear_models/linear_models_non_linear_index.md b/_sources/linear_models/linear_models_non_linear_index.md index d56614515..22fe06b20 100644 --- a/_sources/linear_models/linear_models_non_linear_index.md +++ b/_sources/linear_models/linear_models_non_linear_index.md @@ -1,4 +1,4 @@ -# Modelling non-linear features-target relationships +# Non-linear feature engineering for linear models ```{tableofcontents} diff --git a/_sources/linear_models/linear_models_regression_index.md b/_sources/linear_models/linear_models_regression_index.md deleted file mode 100644 index 8b8144a84..000000000 --- a/_sources/linear_models/linear_models_regression_index.md +++ /dev/null @@ -1,5 +0,0 @@ -# Linear regression - -```{tableofcontents} - -``` diff --git a/_sources/python_scripts/linear_models_ex_02.py b/_sources/python_scripts/linear_models_ex_02.py index 640c44046..f58a1f0fe 100644 --- a/_sources/python_scripts/linear_models_ex_02.py +++ b/_sources/python_scripts/linear_models_ex_02.py @@ -14,100 +14,80 @@ # %% [markdown] # # 📝 Exercise M4.02 # -# The goal of this exercise is to build an intuition on what will be the -# parameters' values of a linear model when the link between the data and the -# target is non-linear. +# In the previous notebook, we showed that we can add new features based on the +# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. +# In that case we only used a single feature in `data`. # -# First, we will generate such non-linear data. +# The aim of this notebook is to train a linear regression algorithm on a +# dataset with more than a single feature. In such a "multi-dimensional" feature +# space we can derive new features of the form `x1 * x2`, `x2 * x3`, +# etc. Products of features are usually called "non-linear or +# multiplicative interactions" between features. # -# ```{tip} -# `np.random.RandomState` allows to create a random number generator which can -# be later used to get deterministic results. -# ``` - -# %% -import numpy as np - -# Set the seed for reproduction -rng = np.random.RandomState(0) - -# Generate data -n_sample = 100 -data_max, data_min = 1.4, -1.4 -len_data = data_max - data_min -data = rng.rand(n_sample) * len_data - len_data / 2 -noise = rng.randn(n_sample) * 0.3 -target = data**3 - 0.5 * data**2 + noise +# Feature engineering can be an important step of a model pipeline as long as +# the new features are expected to be predictive. For instance, think of a +# classification model to decide if a patient has risk of developing a heart +# disease. This would depend on the patient's Body Mass Index which is defined +# as `weight / height ** 2`. +# +# We load the dataset penguins dataset. We first use a set of 3 numerical +# features to predict the target, i.e. the body mass of the penguin. # %% [markdown] # ```{note} -# To ease the plotting, we will create a Pandas dataframe containing the data -# and target +# If you want a deeper overview regarding this dataset, you can refer to the +# Appendix - Datasets description section at the end of this MOOC. # ``` # %% import pandas as pd -full_data = pd.DataFrame({"data": data, "target": target}) +penguins = pd.read_csv("../datasets/penguins.csv") -# %% -import seaborn as sns +columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] +target_name = "Body Mass (g)" -_ = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 -) +# Remove lines with missing values for the columns of interest +penguins_non_missing = penguins[columns + [target_name]].dropna() -# %% [markdown] -# We observe that the link between the data `data` and vector `target` is -# non-linear. For instance, `data` could represent the years of experience -# (normalized) and `target` the salary (normalized). Therefore, the problem here -# would be to infer the salary given the years of experience. -# -# Using the function `f` defined below, find both the `weight` and the -# `intercept` that you think will lead to a good linear model. Plot both the -# data and the predictions of this model. - - -# %% -def f(data, weight=0, intercept=0): - target_predict = weight * data + intercept - return target_predict +data = penguins_non_missing[columns] +target = penguins_non_missing[target_name] +data.head() +# %% [markdown] +# Now it is your turn to train a linear regression model on this dataset. First, +# create a linear regression model. # %% # Write your code here. # %% [markdown] -# Compute the mean squared error for this model +# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) +# as metric. # %% # Write your code here. # %% [markdown] -# Train a linear regression model on this dataset. -# -# ```{warning} -# In scikit-learn, by convention `data` (also called `X` in the scikit-learn -# documentation) should be a 2D matrix of shape `(n_samples, n_features)`. -# If `data` is a 1D vector, you need to reshape it into a matrix with a -# single column if the vector represents a feature or a single row if the -# vector represents a sample. -# ``` +# Compute the mean and std of the MAE in grams (g). # %% -from sklearn.linear_model import LinearRegression - # Write your code here. # %% [markdown] -# Compute predictions from the linear regression model and plot both the data -# and the predictions. +# Now create a pipeline using `make_pipeline` consisting of a +# `PolynomialFeatures` and a linear regression. Set `degree=2` and +# `interaction_only=True` to the feature engineering step. Remember not to +# include the bias to avoid redundancies with the linear's regression intercept. +# +# Use the same strategy as before to cross-validate such a pipeline. # %% # Write your code here. # %% [markdown] -# Compute the mean squared error +# Compute the mean and std of the MAE in grams (g) and compare with the results +# without feature engineering. # %% # Write your code here. diff --git a/_sources/python_scripts/linear_models_ex_03.py b/_sources/python_scripts/linear_models_ex_03.py index 3ab6949a3..9c311e817 100644 --- a/_sources/python_scripts/linear_models_ex_03.py +++ b/_sources/python_scripts/linear_models_ex_03.py @@ -14,24 +14,14 @@ # %% [markdown] # # 📝 Exercise M4.03 # -# In the previous notebook, we showed that we can add new features based on the -# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. -# In that case we only used a single feature in `data`. +# The parameter `penalty` can control the **type** of regularization to use, +# whereas the regularization **strength** is set using the parameter `C`. +# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In +# this exercise, we ask you to train a logistic regression classifier using the +# `penalty="l2"` regularization (which happens to be the default in +# scikit-learn) to find by yourself the effect of the parameter `C`. # -# The aim of this notebook is to train a linear regression algorithm on a -# dataset with more than a single feature. In such a "multi-dimensional" feature -# space we can derive new features of the form `x1 * x2`, `x2 * x3`, -# etc. Products of features are usually called "non-linear or -# multiplicative interactions" between features. -# -# Feature engineering can be an important step of a model pipeline as long as -# the new features are expected to be predictive. For instance, think of a -# classification model to decide if a patient has risk of developing a heart -# disease. This would depend on the patient's Body Mass Index which is defined -# as `weight / height ** 2`. -# -# We load the dataset penguins dataset. We first use a set of 3 numerical -# features to predict the target, i.e. the body mass of the penguin. +# We start by loading the dataset. # %% [markdown] # ```{note} @@ -42,52 +32,51 @@ # %% import pandas as pd -penguins = pd.read_csv("../datasets/penguins.csv") +penguins = pd.read_csv("../datasets/penguins_classification.csv") +# only keep the Adelie and Chinstrap classes +penguins = ( + penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() +) -columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] -target_name = "Body Mass (g)" +culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] +target_column = "Species" -# Remove lines with missing values for the columns of interest -penguins_non_missing = penguins[columns + [target_name]].dropna() +# %% +from sklearn.model_selection import train_test_split -data = penguins_non_missing[columns] -target = penguins_non_missing[target_name] -data.head() +penguins_train, penguins_test = train_test_split(penguins, random_state=0) -# %% [markdown] -# Now it is your turn to train a linear regression model on this dataset. First, -# create a linear regression model. +data_train = penguins_train[culmen_columns] +data_test = penguins_test[culmen_columns] -# %% -# Write your code here. +target_train = penguins_train[target_column] +target_test = penguins_test[target_column] # %% [markdown] -# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) -# as metric. +# First, let's create our predictive model. # %% -# Write your code here. +from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LogisticRegression -# %% [markdown] -# Compute the mean and std of the MAE in grams (g). - -# %% -# Write your code here. +logistic_regression = make_pipeline( + StandardScaler(), LogisticRegression(penalty="l2") +) # %% [markdown] -# Now create a pipeline using `make_pipeline` consisting of a -# `PolynomialFeatures` and a linear regression. Set `degree=2` and -# `interaction_only=True` to the feature engineering step. Remember not to -# include the bias to avoid redundancies with the linear's regression intercept. -# -# Use the same strategy as before to cross-validate such a pipeline. +# Given the following candidates for the `C` parameter, find out the impact of +# `C` on the classifier decision boundary. You can use +# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the +# decision function boundary. # %% +Cs = [0.01, 0.1, 1, 10] + # Write your code here. # %% [markdown] -# Compute the mean and std of the MAE in grams (g) and compare with the results -# without feature engineering. +# Look at the impact of the `C` hyperparameter on the magnitude of the weights. # %% # Write your code here. diff --git a/_sources/python_scripts/linear_models_ex_04.py b/_sources/python_scripts/linear_models_ex_04.py deleted file mode 100644 index 18191bccf..000000000 --- a/_sources/python_scripts/linear_models_ex_04.py +++ /dev/null @@ -1,92 +0,0 @@ -# --- -# jupyter: -# jupytext: -# text_representation: -# extension: .py -# format_name: percent -# format_version: '1.3' -# jupytext_version: 1.14.5 -# kernelspec: -# display_name: Python 3 -# name: python3 -# --- - -# %% [markdown] -# # 📝 Exercise M4.04 -# -# In the previous notebook, we saw the effect of applying some regularization on -# the coefficient of a linear model. -# -# In this exercise, we will study the advantage of using some regularization -# when dealing with correlated features. -# -# We will first create a regression dataset. This dataset will contain 2,000 -# samples and 5 features from which only 2 features will be informative. - -# %% -from sklearn.datasets import make_regression - -data, target, coef = make_regression( - n_samples=2_000, - n_features=5, - n_informative=2, - shuffle=False, - coef=True, - random_state=0, - noise=30, -) - -# %% [markdown] -# When creating the dataset, `make_regression` returns the true coefficient used -# to generate the dataset. Let's plot this information. - -# %% -import pandas as pd - -feature_names = [ - "Relevant feature #0", - "Relevant feature #1", - "Noisy feature #0", - "Noisy feature #1", - "Noisy feature #2", -] -coef = pd.Series(coef, index=feature_names) -coef.plot.barh() -coef - -# %% [markdown] -# Create a `LinearRegression` regressor and fit on the entire dataset and check -# the value of the coefficients. Are the coefficients of the linear regressor -# close to the coefficients used to generate the dataset? - -# %% -# Write your code here. - -# %% [markdown] -# Now, create a new dataset that will be the same as `data` with 4 additional -# columns that will repeat twice features 0 and 1. This procedure will create -# perfectly correlated features. - -# %% -# Write your code here. - -# %% [markdown] -# Fit again the linear regressor on this new dataset and check the coefficients. -# What do you observe? - -# %% -# Write your code here. - -# %% [markdown] -# Create a ridge regressor and fit on the same dataset. Check the coefficients. -# What do you observe? - -# %% -# Write your code here. - -# %% [markdown] -# Can you find the relationship between the ridge coefficients and the original -# coefficients? - -# %% -# Write your code here. diff --git a/_sources/python_scripts/linear_models_ex_05.py b/_sources/python_scripts/linear_models_ex_05.py deleted file mode 100644 index 1c36b83c2..000000000 --- a/_sources/python_scripts/linear_models_ex_05.py +++ /dev/null @@ -1,83 +0,0 @@ -# --- -# jupyter: -# jupytext: -# text_representation: -# extension: .py -# format_name: percent -# format_version: '1.3' -# jupytext_version: 1.14.5 -# kernelspec: -# display_name: Python 3 -# name: python3 -# --- - -# %% [markdown] -# # 📝 Exercise M4.05 -# -# In the previous notebook we set `penalty="none"` to disable regularization -# entirely. This parameter can also control the **type** of regularization to -# use, whereas the regularization **strength** is set using the parameter `C`. -# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In -# this exercise, we ask you to train a logistic regression classifier using the -# `penalty="l2"` regularization (which happens to be the default in -# scikit-learn) to find by yourself the effect of the parameter `C`. -# -# We will start by loading the dataset. - -# %% [markdown] -# ```{note} -# If you want a deeper overview regarding this dataset, you can refer to the -# Appendix - Datasets description section at the end of this MOOC. -# ``` - -# %% -import pandas as pd - -penguins = pd.read_csv("../datasets/penguins_classification.csv") -# only keep the Adelie and Chinstrap classes -penguins = ( - penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() -) - -culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] -target_column = "Species" - -# %% -from sklearn.model_selection import train_test_split - -penguins_train, penguins_test = train_test_split(penguins, random_state=0) - -data_train = penguins_train[culmen_columns] -data_test = penguins_test[culmen_columns] - -target_train = penguins_train[target_column] -target_test = penguins_test[target_column] - -# %% [markdown] -# First, let's create our predictive model. - -# %% -from sklearn.pipeline import make_pipeline -from sklearn.preprocessing import StandardScaler -from sklearn.linear_model import LogisticRegression - -logistic_regression = make_pipeline( - StandardScaler(), LogisticRegression(penalty="l2") -) - -# %% [markdown] -# Given the following candidates for the `C` parameter, find out the impact of -# `C` on the classifier decision boundary. You can use -# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the -# decision function boundary. - -# %% -Cs = [0.01, 0.1, 1, 10] - -# Write your code here. - -# %% [markdown] -# Look at the impact of the `C` hyperparameter on the magnitude of the weights. - -# %% -# Write your code here. diff --git a/_sources/python_scripts/linear_models_sol_02.py b/_sources/python_scripts/linear_models_sol_02.py index d62a4b983..3abc476da 100644 --- a/_sources/python_scripts/linear_models_sol_02.py +++ b/_sources/python_scripts/linear_models_sol_02.py @@ -8,123 +8,127 @@ # %% [markdown] # # 📃 Solution for Exercise M4.02 # -# The goal of this exercise is to build an intuition on what will be the -# parameters' values of a linear model when the link between the data and the -# target is non-linear. +# In the previous notebook, we showed that we can add new features based on the +# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. +# In that case we only used a single feature in `data`. # -# First, we will generate such non-linear data. +# The aim of this notebook is to train a linear regression algorithm on a +# dataset with more than a single feature. In such a "multi-dimensional" feature +# space we can derive new features of the form `x1 * x2`, `x2 * x3`, +# etc. Products of features are usually called "non-linear or +# multiplicative interactions" between features. # -# ```{tip} -# `np.random.RandomState` allows to create a random number generator which can -# be later used to get deterministic results. -# ``` - -# %% -import numpy as np - -# Set the seed for reproduction -rng = np.random.RandomState(0) - -# Generate data -n_sample = 100 -data_max, data_min = 1.4, -1.4 -len_data = data_max - data_min -data = rng.rand(n_sample) * len_data - len_data / 2 -noise = rng.randn(n_sample) * 0.3 -target = data**3 - 0.5 * data**2 + noise +# Feature engineering can be an important step of a model pipeline as long as +# the new features are expected to be predictive. For instance, think of a +# classification model to decide if a patient has risk of developing a heart +# disease. This would depend on the patient's Body Mass Index which is defined +# as `weight / height ** 2`. +# +# We load the dataset penguins dataset. We first use a set of 3 numerical +# features to predict the target, i.e. the body mass of the penguin. # %% [markdown] # ```{note} -# To ease the plotting, we will create a Pandas dataframe containing the data -# and target +# If you want a deeper overview regarding this dataset, you can refer to the +# Appendix - Datasets description section at the end of this MOOC. # ``` # %% import pandas as pd -full_data = pd.DataFrame({"data": data, "target": target}) - -# %% -import seaborn as sns - -_ = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 -) +penguins = pd.read_csv("../datasets/penguins.csv") -# %% [markdown] -# We observe that the link between the data `data` and vector `target` is -# non-linear. For instance, `data` could represent the years of experience -# (normalized) and `target` the salary (normalized). Therefore, the problem here -# would be to infer the salary given the years of experience. -# -# Using the function `f` defined below, find both the `weight` and the -# `intercept` that you think will lead to a good linear model. Plot both the -# data and the predictions of this model. +columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] +target_name = "Body Mass (g)" +# Remove lines with missing values for the columns of interest +penguins_non_missing = penguins[columns + [target_name]].dropna() -# %% -def f(data, weight=0, intercept=0): - target_predict = weight * data + intercept - return target_predict +data = penguins_non_missing[columns] +target = penguins_non_missing[target_name] +data.head() +# %% [markdown] +# Now it is your turn to train a linear regression model on this dataset. First, +# create a linear regression model. # %% # solution -predictions = f(data, weight=1.2, intercept=-0.2) +from sklearn.linear_model import LinearRegression -# %% tags=["solution"] -ax = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 -) -_ = ax.plot(data, predictions) +linear_regression = LinearRegression() # %% [markdown] -# Compute the mean squared error for this model +# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) +# as metric. # %% # solution -from sklearn.metrics import mean_squared_error - -error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2)) -print(f"The MSE is {error}") +from sklearn.model_selection import cross_validate + +cv_results = cross_validate( + linear_regression, + data, + target, + cv=10, + scoring="neg_mean_absolute_error", + n_jobs=2, +) # %% [markdown] -# Train a linear regression model on this dataset. -# -# ```{warning} -# In scikit-learn, by convention `data` (also called `X` in the scikit-learn -# documentation) should be a 2D matrix of shape `(n_samples, n_features)`. -# If `data` is a 1D vector, you need to reshape it into a matrix with a -# single column if the vector represents a feature or a single row if the -# vector represents a sample. -# ``` +# Compute the mean and std of the MAE in grams (g). # %% -from sklearn.linear_model import LinearRegression - # solution -linear_regression = LinearRegression() -data_2d = data.reshape(-1, 1) -linear_regression.fit(data_2d, target) +print( + "Mean absolute error on testing set with original features: " + f"{-cv_results['test_score'].mean():.3f} ± " + f"{cv_results['test_score'].std():.3f} g" +) # %% [markdown] -# Compute predictions from the linear regression model and plot both the data -# and the predictions. +# Now create a pipeline using `make_pipeline` consisting of a +# `PolynomialFeatures` and a linear regression. Set `degree=2` and +# `interaction_only=True` to the feature engineering step. Remember not to +# include the bias to avoid redundancies with the linear's regression intercept. +# +# Use the same strategy as before to cross-validate such a pipeline. # %% # solution -predictions = linear_regression.predict(data_2d) +from sklearn.preprocessing import PolynomialFeatures +from sklearn.pipeline import make_pipeline -# %% tags=["solution"] -ax = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 +poly_features = PolynomialFeatures( + degree=2, include_bias=False, interaction_only=True +) +linear_regression_interactions = make_pipeline( + poly_features, linear_regression +) + +cv_results = cross_validate( + linear_regression_interactions, + data, + target, + cv=10, + scoring="neg_mean_absolute_error", + n_jobs=2, ) -_ = ax.plot(data, predictions) # %% [markdown] -# Compute the mean squared error +# Compute the mean and std of the MAE in grams (g) and compare with the results +# without feature engineering. # %% # solution -error = mean_squared_error(target, predictions) -print(f"The MSE is {error}") +print( + "Mean absolute error on testing set with interactions: " + f"{-cv_results['test_score'].mean():.3f} ± " + f"{cv_results['test_score'].std():.3f} g" +) + +# %% [markdown] tags=["solution"] +# We observe that the mean absolute error is lower and less spread with the +# enriched features. In this case the "interactions" are indeed predictive. In +# the following notebook we will see what happens when the enriched features are +# non-predictive and how to deal with this case. diff --git a/_sources/python_scripts/linear_models_sol_03.py b/_sources/python_scripts/linear_models_sol_03.py index 0cacfcf0d..d789c8522 100644 --- a/_sources/python_scripts/linear_models_sol_03.py +++ b/_sources/python_scripts/linear_models_sol_03.py @@ -8,24 +8,14 @@ # %% [markdown] # # 📃 Solution for Exercise M4.03 # -# In the previous notebook, we showed that we can add new features based on the -# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. -# In that case we only used a single feature in `data`. +# The parameter `penalty` can control the **type** of regularization to use, +# whereas the regularization **strength** is set using the parameter `C`. +# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In +# this exercise, we ask you to train a logistic regression classifier using the +# `penalty="l2"` regularization (which happens to be the default in +# scikit-learn) to find by yourself the effect of the parameter `C`. # -# The aim of this notebook is to train a linear regression algorithm on a -# dataset with more than a single feature. In such a "multi-dimensional" feature -# space we can derive new features of the form `x1 * x2`, `x2 * x3`, -# etc. Products of features are usually called "non-linear or -# multiplicative interactions" between features. -# -# Feature engineering can be an important step of a model pipeline as long as -# the new features are expected to be predictive. For instance, think of a -# classification model to decide if a patient has risk of developing a heart -# disease. This would depend on the patient's Body Mass Index which is defined -# as `weight / height ** 2`. -# -# We load the dataset penguins dataset. We first use a set of 3 numerical -# features to predict the target, i.e. the body mass of the penguin. +# We start by loading the dataset. # %% [markdown] # ```{note} @@ -36,99 +26,97 @@ # %% import pandas as pd -penguins = pd.read_csv("../datasets/penguins.csv") - -columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] -target_name = "Body Mass (g)" - -# Remove lines with missing values for the columns of interest -penguins_non_missing = penguins[columns + [target_name]].dropna() - -data = penguins_non_missing[columns] -target = penguins_non_missing[target_name] -data.head() +penguins = pd.read_csv("../datasets/penguins_classification.csv") +# only keep the Adelie and Chinstrap classes +penguins = ( + penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() +) -# %% [markdown] -# Now it is your turn to train a linear regression model on this dataset. First, -# create a linear regression model. +culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] +target_column = "Species" # %% -# solution -from sklearn.linear_model import LinearRegression +from sklearn.model_selection import train_test_split -linear_regression = LinearRegression() +penguins_train, penguins_test = train_test_split(penguins, random_state=0) -# %% [markdown] -# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) -# as metric. +data_train = penguins_train[culmen_columns] +data_test = penguins_test[culmen_columns] -# %% -# solution -from sklearn.model_selection import cross_validate - -cv_results = cross_validate( - linear_regression, - data, - target, - cv=10, - scoring="neg_mean_absolute_error", - n_jobs=2, -) +target_train = penguins_train[target_column] +target_test = penguins_test[target_column] # %% [markdown] -# Compute the mean and std of the MAE in grams (g). +# First, let's create our predictive model. # %% -# solution -print( - "Mean absolute error on testing set with original features: " - f"{-cv_results['test_score'].mean():.3f} ± " - f"{cv_results['test_score'].std():.3f} g" +from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LogisticRegression + +logistic_regression = make_pipeline( + StandardScaler(), LogisticRegression(penalty="l2") ) # %% [markdown] -# Now create a pipeline using `make_pipeline` consisting of a -# `PolynomialFeatures` and a linear regression. Set `degree=2` and -# `interaction_only=True` to the feature engineering step. Remember not to -# include the bias to avoid redundancies with the linear's regression intercept. -# -# Use the same strategy as before to cross-validate such a pipeline. +# Given the following candidates for the `C` parameter, find out the impact of +# `C` on the classifier decision boundary. You can use +# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the +# decision function boundary. # %% -# solution -from sklearn.preprocessing import PolynomialFeatures -from sklearn.pipeline import make_pipeline - -poly_features = PolynomialFeatures( - degree=2, include_bias=False, interaction_only=True -) -linear_regression_interactions = make_pipeline( - poly_features, linear_regression -) +Cs = [0.01, 0.1, 1, 10] -cv_results = cross_validate( - linear_regression_interactions, - data, - target, - cv=10, - scoring="neg_mean_absolute_error", - n_jobs=2, -) +# solution +import matplotlib.pyplot as plt +import seaborn as sns +from sklearn.inspection import DecisionBoundaryDisplay + +for C in Cs: + logistic_regression.set_params(logisticregression__C=C) + logistic_regression.fit(data_train, target_train) + accuracy = logistic_regression.score(data_test, target_test) + + DecisionBoundaryDisplay.from_estimator( + logistic_regression, + data_test, + response_method="predict", + cmap="RdBu_r", + alpha=0.5, + ) + sns.scatterplot( + data=penguins_test, + x=culmen_columns[0], + y=culmen_columns[1], + hue=target_column, + palette=["tab:red", "tab:blue"], + ) + plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left") + plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}") # %% [markdown] -# Compute the mean and std of the MAE in grams (g) and compare with the results -# without feature engineering. +# Look at the impact of the `C` hyperparameter on the magnitude of the weights. # %% # solution -print( - "Mean absolute error on testing set with interactions: " - f"{-cv_results['test_score'].mean():.3f} ± " - f"{cv_results['test_score'].std():.3f} g" -) +weights_ridge = [] +for C in Cs: + logistic_regression.set_params(logisticregression__C=C) + logistic_regression.fit(data_train, target_train) + coefs = logistic_regression[-1].coef_[0] + weights_ridge.append(pd.Series(coefs, index=culmen_columns)) + +# %% tags=["solution"] +weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs]) +weights_ridge.plot.barh() +_ = plt.title("LogisticRegression weights depending of C") # %% [markdown] tags=["solution"] -# We observe that the mean absolute error is lower and less spread with the -# enriched features. In this case the "interactions" are indeed predictive. In -# the following notebook we will see what happens when the enriched features are -# non-predictive and how to deal with this case. +# We see that a small `C` will shrink the weights values toward zero. It means +# that a small `C` provides a more regularized model. Thus, `C` is the inverse +# of the `alpha` coefficient in the `Ridge` model. +# +# Besides, with a strong penalty (i.e. small `C` value), the weight of the +# feature "Culmen Depth (mm)" is almost zero. It explains why the decision +# separation in the plot is almost perpendicular to the "Culmen Length (mm)" +# feature. diff --git a/_sources/python_scripts/linear_models_sol_04.py b/_sources/python_scripts/linear_models_sol_04.py deleted file mode 100644 index a759c3d24..000000000 --- a/_sources/python_scripts/linear_models_sol_04.py +++ /dev/null @@ -1,269 +0,0 @@ -# --- -# jupyter: -# kernelspec: -# display_name: Python 3 -# name: python3 -# --- - -# %% [markdown] -# # 📃 Solution for Exercise M4.04 -# -# In the previous notebook, we saw the effect of applying some regularization on -# the coefficient of a linear model. -# -# In this exercise, we will study the advantage of using some regularization -# when dealing with correlated features. -# -# We will first create a regression dataset. This dataset will contain 2,000 -# samples and 5 features from which only 2 features will be informative. - -# %% -from sklearn.datasets import make_regression - -data, target, coef = make_regression( - n_samples=2_000, - n_features=5, - n_informative=2, - shuffle=False, - coef=True, - random_state=0, - noise=30, -) - -# %% [markdown] -# When creating the dataset, `make_regression` returns the true coefficient used -# to generate the dataset. Let's plot this information. - -# %% -import pandas as pd - -feature_names = [ - "Relevant feature #0", - "Relevant feature #1", - "Noisy feature #0", - "Noisy feature #1", - "Noisy feature #2", -] -coef = pd.Series(coef, index=feature_names) -coef.plot.barh() -coef - -# %% [markdown] -# Create a `LinearRegression` regressor and fit on the entire dataset and check -# the value of the coefficients. Are the coefficients of the linear regressor -# close to the coefficients used to generate the dataset? - -# %% -# solution -from sklearn.linear_model import LinearRegression - -linear_regression = LinearRegression() -linear_regression.fit(data, target) -linear_regression.coef_ - -# %% tags=["solution"] -feature_names = [ - "Relevant feature #0", - "Relevant feature #1", - "Noisy feature #0", - "Noisy feature #1", - "Noisy feature #2", -] -coef = pd.Series(linear_regression.coef_, index=feature_names) -_ = coef.plot.barh() - -# %% [markdown] tags=["solution"] -# We see that the coefficients are close to the coefficients used to generate -# the dataset. The dispersion is indeed cause by the noise injected during the -# dataset generation. - -# %% [markdown] -# Now, create a new dataset that will be the same as `data` with 4 additional -# columns that will repeat twice features 0 and 1. This procedure will create -# perfectly correlated features. - -# %% -# solution -import numpy as np - -data = np.concatenate([data, data[:, [0, 1]], data[:, [0, 1]]], axis=1) - -# %% [markdown] -# Fit again the linear regressor on this new dataset and check the coefficients. -# What do you observe? - -# %% -# solution -linear_regression = LinearRegression() -linear_regression.fit(data, target) -linear_regression.coef_ - -# %% tags=["solution"] -feature_names = [ - "Relevant feature #0", - "Relevant feature #1", - "Noisy feature #0", - "Noisy feature #1", - "Noisy feature #2", - "First repetition of feature #0", - "First repetition of feature #1", - "Second repetition of feature #0", - "Second repetition of feature #1", -] -coef = pd.Series(linear_regression.coef_, index=feature_names) -_ = coef.plot.barh() - -# %% [markdown] tags=["solution"] -# We see that the coefficient values are far from what one could expect. By -# repeating the informative features, one would have expected these coefficients -# to be similarly informative. -# -# Instead, we see that some coefficients have a huge norm ~1e14. It indeed means -# that we try to solve an mathematical ill-posed problem. Indeed, finding -# coefficients in a linear regression involves inverting the matrix -# `np.dot(data.T, data)` which is not possible (or lead to high numerical -# errors). - -# %% [markdown] -# Create a ridge regressor and fit on the same dataset. Check the coefficients. -# What do you observe? - -# %% -# solution -from sklearn.linear_model import Ridge - -ridge = Ridge() -ridge.fit(data, target) -ridge.coef_ - -# %% tags=["solution"] -coef = pd.Series(ridge.coef_, index=feature_names) -_ = coef.plot.barh() - -# %% [markdown] tags=["solution"] -# We see that the penalty applied on the weights give a better results: the -# values of the coefficients do not suffer from numerical issues. Indeed, the -# matrix to be inverted internally is `np.dot(data.T, data) + alpha * I`. Adding -# this penalty `alpha` allow the inversion without numerical issue. - -# %% [markdown] -# Can you find the relationship between the ridge coefficients and the original -# coefficients? - -# %% -# solution -ridge.coef_[:5] * 3 - -# %% [markdown] tags=["solution"] -# Repeating three times each informative features induced to divide the ridge -# coefficients by three. - -# %% [markdown] tags=["solution"] -# ```{tip} -# We advise to always use a penalty to shrink the magnitude of the weights -# toward zero (also called "l2 penalty"). In scikit-learn, `LogisticRegression` -# applies such penalty by default. However, one needs to use `Ridge` (and even -# `RidgeCV` to tune the parameter `alpha`) instead of `LinearRegression`. -# -# Other kinds of regularizations exist but will not be covered in this course. -# ``` -# -# ## Dealing with correlation between one-hot encoded features -# -# In this section, we will focus on how to deal with correlated features that -# arise naturally when one-hot encoding categorical features. -# -# Let's first load the Ames housing dataset and take a subset of features that -# are only categorical features. - -# %% tags=["solution"] -import pandas as pd -from sklearn.model_selection import train_test_split - -ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?") -ames_housing = ames_housing.drop(columns="Id") - -categorical_columns = ["Street", "Foundation", "CentralAir", "PavedDrive"] -target_name = "SalePrice" -X, y = ames_housing[categorical_columns], ames_housing[target_name] - -X_train, X_test, y_train, y_test = train_test_split( - X, y, test_size=0.2, random_state=0 -) - -# %% [markdown] tags=["solution"] -# -# We previously presented that a `OneHotEncoder` creates as many columns as -# categories. Therefore, there is always one column (i.e. one encoded category) -# that can be inferred from the others. Thus, `OneHotEncoder` creates collinear -# features. -# -# We illustrate this behaviour by considering the "CentralAir" feature that -# contains only two categories: - -# %% tags=["solution"] -X_train["CentralAir"] - -# %% tags=["solution"] -from sklearn.preprocessing import OneHotEncoder - -single_feature = ["CentralAir"] -encoder = OneHotEncoder(sparse_output=False, dtype=np.int32) -X_trans = encoder.fit_transform(X_train[single_feature]) -X_trans = pd.DataFrame( - X_trans, - columns=encoder.get_feature_names_out(input_features=single_feature), -) -X_trans - -# %% [markdown] tags=["solution"] -# -# Here, we see that the encoded category "CentralAir_N" is the opposite of the -# encoded category "CentralAir_Y". Therefore, we observe that using a -# `OneHotEncoder` creates two features having the problematic pattern observed -# earlier in this exercise. Training a linear regression model on such a of -# one-hot encoded binary feature can therefore lead to numerical problems, -# especially without regularization. Furthermore, the two one-hot features are -# redundant as they encode exactly the same information in opposite ways. -# -# Using regularization helps to overcome the numerical issues that we -# highlighted earlier in this exercise. -# -# Another strategy is to arbitrarily drop one of the encoded categories. -# Scikit-learn provides such an option by setting the parameter `drop` in the -# `OneHotEncoder`. This parameter can be set to `first` to always drop the first -# encoded category or `binary_only` to only drop a column in the case of binary -# categories. - -# %% tags=["solution"] -encoder = OneHotEncoder(drop="first", sparse_output=False, dtype=np.int32) -X_trans = encoder.fit_transform(X_train[single_feature]) -X_trans = pd.DataFrame( - X_trans, - columns=encoder.get_feature_names_out(input_features=single_feature), -) -X_trans - -# %% [markdown] tags=["solution"] -# -# We see that only the second column of the previous encoded data is kept. -# Dropping one of the one-hot encoded column is a common practice, especially -# for binary categorical features. Note however that this breaks symmetry -# between categories and impacts the number of coefficients of the model, their -# values, and thus their meaning, especially when applying strong -# regularization. -# -# Let's finally illustrate how to use this option is a machine-learning -# pipeline: - -# %% tags=["solution"] -from sklearn.pipeline import make_pipeline - -model = make_pipeline(OneHotEncoder(drop="first", dtype=np.int32), Ridge()) -model.fit(X_train, y_train) -n_categories = [X_train[col].nunique() for col in X_train.columns] -print(f"R2 score on the testing set: {model.score(X_test, y_test):.2f}") -print( - f"Our model contains {model[-1].coef_.size} features while " - f"{sum(n_categories)} categories are originally available." -) diff --git a/_sources/python_scripts/linear_models_sol_05.py b/_sources/python_scripts/linear_models_sol_05.py deleted file mode 100644 index bc4a15df1..000000000 --- a/_sources/python_scripts/linear_models_sol_05.py +++ /dev/null @@ -1,123 +0,0 @@ -# --- -# jupyter: -# kernelspec: -# display_name: Python 3 -# name: python3 -# --- - -# %% [markdown] -# # 📃 Solution for Exercise M4.05 -# -# In the previous notebook we set `penalty="none"` to disable regularization -# entirely. This parameter can also control the **type** of regularization to -# use, whereas the regularization **strength** is set using the parameter `C`. -# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In -# this exercise, we ask you to train a logistic regression classifier using the -# `penalty="l2"` regularization (which happens to be the default in -# scikit-learn) to find by yourself the effect of the parameter `C`. -# -# We will start by loading the dataset. - -# %% [markdown] -# ```{note} -# If you want a deeper overview regarding this dataset, you can refer to the -# Appendix - Datasets description section at the end of this MOOC. -# ``` - -# %% -import pandas as pd - -penguins = pd.read_csv("../datasets/penguins_classification.csv") -# only keep the Adelie and Chinstrap classes -penguins = ( - penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() -) - -culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] -target_column = "Species" - -# %% -from sklearn.model_selection import train_test_split - -penguins_train, penguins_test = train_test_split(penguins, random_state=0) - -data_train = penguins_train[culmen_columns] -data_test = penguins_test[culmen_columns] - -target_train = penguins_train[target_column] -target_test = penguins_test[target_column] - -# %% [markdown] -# First, let's create our predictive model. - -# %% -from sklearn.pipeline import make_pipeline -from sklearn.preprocessing import StandardScaler -from sklearn.linear_model import LogisticRegression - -logistic_regression = make_pipeline( - StandardScaler(), LogisticRegression(penalty="l2") -) - -# %% [markdown] -# Given the following candidates for the `C` parameter, find out the impact of -# `C` on the classifier decision boundary. You can use -# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the -# decision function boundary. - -# %% -Cs = [0.01, 0.1, 1, 10] - -# solution -import matplotlib.pyplot as plt -import seaborn as sns -from sklearn.inspection import DecisionBoundaryDisplay - -for C in Cs: - logistic_regression.set_params(logisticregression__C=C) - logistic_regression.fit(data_train, target_train) - accuracy = logistic_regression.score(data_test, target_test) - - DecisionBoundaryDisplay.from_estimator( - logistic_regression, - data_test, - response_method="predict", - cmap="RdBu_r", - alpha=0.5, - ) - sns.scatterplot( - data=penguins_test, - x=culmen_columns[0], - y=culmen_columns[1], - hue=target_column, - palette=["tab:red", "tab:blue"], - ) - plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left") - plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}") - -# %% [markdown] -# Look at the impact of the `C` hyperparameter on the magnitude of the weights. - -# %% -# solution -weights_ridge = [] -for C in Cs: - logistic_regression.set_params(logisticregression__C=C) - logistic_regression.fit(data_train, target_train) - coefs = logistic_regression[-1].coef_[0] - weights_ridge.append(pd.Series(coefs, index=culmen_columns)) - -# %% tags=["solution"] -weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs]) -weights_ridge.plot.barh() -_ = plt.title("LogisticRegression weights depending of C") - -# %% [markdown] tags=["solution"] -# We see that a small `C` will shrink the weights values toward zero. It means -# that a small `C` provides a more regularized model. Thus, `C` is the inverse -# of the `alpha` coefficient in the `Ridge` model. -# -# Besides, with a strong penalty (i.e. small `C` value), the weight of the -# feature "Culmen Depth (mm)" is almost zero. It explains why the decision -# separation in the plot is almost perpendicular to the "Culmen Length (mm)" -# feature. diff --git a/_sources/python_scripts/logistic_regression.py b/_sources/python_scripts/logistic_regression.py index 3156ebda0..45487341b 100644 --- a/_sources/python_scripts/logistic_regression.py +++ b/_sources/python_scripts/logistic_regression.py @@ -78,9 +78,7 @@ from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression -logistic_regression = make_pipeline( - StandardScaler(), LogisticRegression(penalty=None) -) +logistic_regression = make_pipeline(StandardScaler(), LogisticRegression()) logistic_regression.fit(data_train, target_train) accuracy = logistic_regression.score(data_test, target_test) print(f"Accuracy on test set: {accuracy:.3f}") @@ -124,8 +122,7 @@ # %% [markdown] # Thus, we see that our decision function is represented by a line separating -# the 2 classes. We should also note that we did not impose any regularization -# by setting the parameter `penalty` to `'none'`. +# the 2 classes. # # Since the line is oblique, it means that we used a combination of both # features: diff --git a/appendix/acknowledgement.html b/appendix/acknowledgement.html index 82038225f..7cd243180 100644 --- a/appendix/acknowledgement.html +++ b/appendix/acknowledgement.html @@ -256,38 +256,28 @@
  • Intuitions on linear models -
  • -
  • Linear regression
  • -
  • Modelling non-linear features-target relationships