diff --git a/jupyter-book/_toc.yml b/jupyter-book/_toc.yml
index dfc89c04f..80bb88aa3 100644
--- a/jupyter-book/_toc.yml
+++ b/jupyter-book/_toc.yml
@@ -91,34 +91,26 @@ parts:
sections:
- file: linear_models/linear_models_slides
- file: linear_models/linear_models_quiz_m4_01
- - file: linear_models/linear_models_regression_index
- sections:
- file: python_scripts/linear_regression_without_sklearn
- file: python_scripts/linear_models_ex_01
- file: python_scripts/linear_models_sol_01
- file: python_scripts/linear_regression_in_sklearn
+ - file: python_scripts/logistic_regression
- file: linear_models/linear_models_quiz_m4_02
- file: linear_models/linear_models_non_linear_index
sections:
+ - file: python_scripts/linear_regression_non_linear_link
- file: python_scripts/linear_models_ex_02
- file: python_scripts/linear_models_sol_02
- - file: python_scripts/linear_regression_non_linear_link
- - file: python_scripts/linear_models_ex_03
- - file: python_scripts/linear_models_sol_03
+ - file: python_scripts/logistic_regression_non_linear
- file: linear_models/linear_models_quiz_m4_03
- file: linear_models/linear_models_regularization_index
sections:
- file: linear_models/regularized_linear_models_slides
- file: python_scripts/linear_models_regularization
- - file: python_scripts/linear_models_ex_04
- - file: python_scripts/linear_models_sol_04
- file: linear_models/linear_models_quiz_m4_04
- - file: linear_models/linear_models_classification_index
- sections:
- - file: python_scripts/logistic_regression
- - file: python_scripts/linear_models_ex_05
- - file: python_scripts/linear_models_sol_05
- - file: python_scripts/logistic_regression_non_linear
+ - file: python_scripts/linear_models_ex_03
+ - file: python_scripts/linear_models_sol_03
- file: linear_models/linear_models_quiz_m4_05
- file: linear_models/linear_models_wrap_up_quiz
- file: linear_models/linear_models_module_take_away
diff --git a/jupyter-book/linear_models/linear_models_classification_index.md b/jupyter-book/linear_models/linear_models_classification_index.md
deleted file mode 100644
index 81399c436..000000000
--- a/jupyter-book/linear_models/linear_models_classification_index.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Linear model for classification
-
-```{tableofcontents}
-
-```
diff --git a/jupyter-book/linear_models/linear_models_non_linear_index.md b/jupyter-book/linear_models/linear_models_non_linear_index.md
index d56614515..22fe06b20 100644
--- a/jupyter-book/linear_models/linear_models_non_linear_index.md
+++ b/jupyter-book/linear_models/linear_models_non_linear_index.md
@@ -1,4 +1,4 @@
-# Modelling non-linear features-target relationships
+# Non-linear feature engineering for linear models
```{tableofcontents}
diff --git a/jupyter-book/linear_models/linear_models_regression_index.md b/jupyter-book/linear_models/linear_models_regression_index.md
deleted file mode 100644
index 8b8144a84..000000000
--- a/jupyter-book/linear_models/linear_models_regression_index.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Linear regression
-
-```{tableofcontents}
-
-```
diff --git a/notebooks/linear_models_ex_02.ipynb b/notebooks/linear_models_ex_02.ipynb
index c9c0aad96..4cf750e81 100644
--- a/notebooks/linear_models_ex_02.ipynb
+++ b/notebooks/linear_models_ex_02.ipynb
@@ -4,39 +4,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# \ud83d\udcdd Exercise M4.02\n",
+ "# \ud83d\udcdd Exercise M4.03\n",
"\n",
- "The goal of this exercise is to build an intuition on what will be the\n",
- "parameters' values of a linear model when the link between the data and the\n",
- "target is non-linear.\n",
+ "In all previous notebooks, we only used a single feature in `data`. But we\n",
+ "have already shown that we could add new features to make the model more\n",
+ "expressive by deriving new features, based on the original feature.\n",
"\n",
- "First, we will generate such non-linear data.\n",
+ "The aim of this notebook is to train a linear regression algorithm on a\n",
+ "dataset with more than a single feature.\n",
"\n",
- "
\n",
- "
Tip
\n",
- "
np.random.RandomState allows to create a random number generator which can\n",
- "be later used to get deterministic results.
\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "\n",
- "# Set the seed for reproduction\n",
- "rng = np.random.RandomState(0)\n",
- "\n",
- "# Generate data\n",
- "n_sample = 100\n",
- "data_max, data_min = 1.4, -1.4\n",
- "len_data = data_max - data_min\n",
- "data = rng.rand(n_sample) * len_data - len_data / 2\n",
- "noise = rng.randn(n_sample) * 0.3\n",
- "target = data**3 - 0.5 * data**2 + noise"
+ "We will load a dataset about house prices in California. The dataset consists\n",
+ "of 8 features regarding the demography and geography of districts in\n",
+ "California and the aim is to predict the median house price of each district.\n",
+ "We will use all 8 features to predict the target, the median house price."
]
},
{
@@ -45,8 +25,8 @@
"source": [
"\n",
"
Note
\n",
- "
To ease the plotting, we will create a Pandas dataframe containing the data\n",
- "and target
\n",
+ "
If you want a deeper overview regarding this dataset, you can refer to the\n",
+ "Appendix - Datasets description section at the end of this MOOC.
\n",
"
"
]
},
@@ -56,65 +36,19 @@
"metadata": {},
"outputs": [],
"source": [
- "import pandas as pd\n",
+ "from sklearn.datasets import fetch_california_housing\n",
"\n",
- "full_data = pd.DataFrame({\"data\": data, \"target\": target})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import seaborn as sns\n",
- "\n",
- "_ = sns.scatterplot(\n",
- " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "lines_to_next_cell": 2
- },
- "source": [
- "We observe that the link between the data `data` and vector `target` is\n",
- "non-linear. For instance, `data` could represent the years of experience\n",
- "(normalized) and `target` the salary (normalized). Therefore, the problem here\n",
- "would be to infer the salary given the years of experience.\n",
- "\n",
- "Using the function `f` defined below, find both the `weight` and the\n",
- "`intercept` that you think will lead to a good linear model. Plot both the\n",
- "data and the predictions of this model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def f(data, weight=0, intercept=0):\n",
- " target_predict = weight * data + intercept\n",
- " return target_predict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
+ "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n",
+ "target *= 100 # rescale the target in k$\n",
+ "data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Compute the mean squared error for this model"
+ "Now it is your turn to train a linear regression model on this dataset. First,\n",
+ "create a linear regression model."
]
},
{
@@ -130,16 +64,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Train a linear regression model on this dataset.\n",
- "\n",
- "\n",
- "
Warning
\n",
- "
In scikit-learn, by convention data (also called X in the scikit-learn\n",
- "documentation) should be a 2D matrix of shape (n_samples, n_features).\n",
- "If data is a 1D vector, you need to reshape it into a matrix with a\n",
- "single column if the vector represents a feature or a single row if the\n",
- "vector represents a sample.
\n",
- "
"
+ "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n",
+ "as metric. Be sure to *return* the fitted *estimators*."
]
},
{
@@ -148,8 +74,6 @@
"metadata": {},
"outputs": [],
"source": [
- "from sklearn.linear_model import LinearRegression\n",
- "\n",
"# Write your code here."
]
},
@@ -157,8 +81,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Compute predictions from the linear regression model and plot both the data\n",
- "and the predictions."
+ "Compute the mean and std of the MAE in thousands of dollars (k$)."
]
},
{
@@ -172,9 +95,15 @@
},
{
"cell_type": "markdown",
- "metadata": {},
+ "metadata": {
+ "lines_to_next_cell": 2
+ },
"source": [
- "Compute the mean squared error"
+ "Inspect the fitted model using a box plot to show the distribution of values\n",
+ "for the coefficients returned from the cross-validation. Hint: use the\n",
+ "function\n",
+ "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n",
+ "to create a box plot."
]
},
{
diff --git a/notebooks/linear_models_ex_03.ipynb b/notebooks/linear_models_ex_03.ipynb
deleted file mode 100644
index 4cf750e81..000000000
--- a/notebooks/linear_models_ex_03.ipynb
+++ /dev/null
@@ -1,130 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# \ud83d\udcdd Exercise M4.03\n",
- "\n",
- "In all previous notebooks, we only used a single feature in `data`. But we\n",
- "have already shown that we could add new features to make the model more\n",
- "expressive by deriving new features, based on the original feature.\n",
- "\n",
- "The aim of this notebook is to train a linear regression algorithm on a\n",
- "dataset with more than a single feature.\n",
- "\n",
- "We will load a dataset about house prices in California. The dataset consists\n",
- "of 8 features regarding the demography and geography of districts in\n",
- "California and the aim is to predict the median house price of each district.\n",
- "We will use all 8 features to predict the target, the median house price."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "
Note
\n",
- "
If you want a deeper overview regarding this dataset, you can refer to the\n",
- "Appendix - Datasets description section at the end of this MOOC.
\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.datasets import fetch_california_housing\n",
- "\n",
- "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n",
- "target *= 100 # rescale the target in k$\n",
- "data.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now it is your turn to train a linear regression model on this dataset. First,\n",
- "create a linear regression model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n",
- "as metric. Be sure to *return* the fitted *estimators*."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Compute the mean and std of the MAE in thousands of dollars (k$)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "lines_to_next_cell": 2
- },
- "source": [
- "Inspect the fitted model using a box plot to show the distribution of values\n",
- "for the coefficients returned from the cross-validation. Hint: use the\n",
- "function\n",
- "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n",
- "to create a box plot."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- }
- ],
- "metadata": {
- "jupytext": {
- "main_language": "python"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
diff --git a/notebooks/linear_models_ex_04.ipynb b/notebooks/linear_models_ex_04.ipynb
deleted file mode 100644
index 77086778b..000000000
--- a/notebooks/linear_models_ex_04.ipynb
+++ /dev/null
@@ -1,165 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# \ud83d\udcdd Exercise M4.04\n",
- "\n",
- "In the previous notebook, we saw the effect of applying some regularization on\n",
- "the coefficient of a linear model.\n",
- "\n",
- "In this exercise, we will study the advantage of using some regularization\n",
- "when dealing with correlated features.\n",
- "\n",
- "We will first create a regression dataset. This dataset will contain 2,000\n",
- "samples and 5 features from which only 2 features will be informative."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.datasets import make_regression\n",
- "\n",
- "data, target, coef = make_regression(\n",
- " n_samples=2_000,\n",
- " n_features=5,\n",
- " n_informative=2,\n",
- " shuffle=False,\n",
- " coef=True,\n",
- " random_state=0,\n",
- " noise=30,\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "When creating the dataset, `make_regression` returns the true coefficient used\n",
- "to generate the dataset. Let's plot this information."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "\n",
- "feature_names = [\n",
- " \"Relevant feature #0\",\n",
- " \"Relevant feature #1\",\n",
- " \"Noisy feature #0\",\n",
- " \"Noisy feature #1\",\n",
- " \"Noisy feature #2\",\n",
- "]\n",
- "coef = pd.Series(coef, index=feature_names)\n",
- "coef.plot.barh()\n",
- "coef"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Create a `LinearRegression` regressor and fit on the entire dataset and check\n",
- "the value of the coefficients. Are the coefficients of the linear regressor\n",
- "close to the coefficients used to generate the dataset?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now, create a new dataset that will be the same as `data` with 4 additional\n",
- "columns that will repeat twice features 0 and 1. This procedure will create\n",
- "perfectly correlated features."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Fit again the linear regressor on this new dataset and check the coefficients.\n",
- "What do you observe?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Create a ridge regressor and fit on the same dataset. Check the coefficients.\n",
- "What do you observe?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Can you find the relationship between the ridge coefficients and the original\n",
- "coefficients?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- }
- ],
- "metadata": {
- "jupytext": {
- "main_language": "python"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
diff --git a/notebooks/linear_models_ex_05.ipynb b/notebooks/linear_models_ex_05.ipynb
deleted file mode 100644
index 866d52086..000000000
--- a/notebooks/linear_models_ex_05.ipynb
+++ /dev/null
@@ -1,137 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# \ud83d\udcdd Exercise M4.05\n",
- "\n",
- "In the previous notebook we set `penalty=\"none\"` to disable regularization\n",
- "entirely. This parameter can also control the **type** of regularization to\n",
- "use, whereas the regularization **strength** is set using the parameter `C`.\n",
- "Setting`penalty=\"none\"` is equivalent to an infinitely large value of `C`. In\n",
- "this exercise, we ask you to train a logistic regression classifier using the\n",
- "`penalty=\"l2\"` regularization (which happens to be the default in\n",
- "scikit-learn) to find by yourself the effect of the parameter `C`.\n",
- "\n",
- "We will start by loading the dataset."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "
Note
\n",
- "
If you want a deeper overview regarding this dataset, you can refer to the\n",
- "Appendix - Datasets description section at the end of this MOOC.
\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "\n",
- "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n",
- "# only keep the Adelie and Chinstrap classes\n",
- "penguins = (\n",
- " penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n",
- ")\n",
- "\n",
- "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n",
- "target_column = \"Species\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.model_selection import train_test_split\n",
- "\n",
- "penguins_train, penguins_test = train_test_split(penguins, random_state=0)\n",
- "\n",
- "data_train = penguins_train[culmen_columns]\n",
- "data_test = penguins_test[culmen_columns]\n",
- "\n",
- "target_train = penguins_train[target_column]\n",
- "target_test = penguins_test[target_column]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "First, let's create our predictive model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.pipeline import make_pipeline\n",
- "from sklearn.preprocessing import StandardScaler\n",
- "from sklearn.linear_model import LogisticRegression\n",
- "\n",
- "logistic_regression = make_pipeline(\n",
- " StandardScaler(), LogisticRegression(penalty=\"l2\")\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Given the following candidates for the `C` parameter, find out the impact of\n",
- "`C` on the classifier decision boundary. You can use\n",
- "`sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the\n",
- "decision function boundary."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "Cs = [0.01, 0.1, 1, 10]\n",
- "\n",
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Look at the impact of the `C` hyperparameter on the magnitude of the weights."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- }
- ],
- "metadata": {
- "jupytext": {
- "main_language": "python"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
diff --git a/notebooks/linear_models_sol_02.ipynb b/notebooks/linear_models_sol_02.ipynb
index d56864c4e..634c43171 100644
--- a/notebooks/linear_models_sol_02.ipynb
+++ b/notebooks/linear_models_sol_02.ipynb
@@ -4,39 +4,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# \ud83d\udcc3 Solution for Exercise M4.02\n",
+ "# \ud83d\udcc3 Solution for Exercise M4.03\n",
"\n",
- "The goal of this exercise is to build an intuition on what will be the\n",
- "parameters' values of a linear model when the link between the data and the\n",
- "target is non-linear.\n",
+ "In all previous notebooks, we only used a single feature in `data`. But we\n",
+ "have already shown that we could add new features to make the model more\n",
+ "expressive by deriving new features, based on the original feature.\n",
"\n",
- "First, we will generate such non-linear data.\n",
+ "The aim of this notebook is to train a linear regression algorithm on a\n",
+ "dataset with more than a single feature.\n",
"\n",
- "\n",
- "
Tip
\n",
- "
np.random.RandomState allows to create a random number generator which can\n",
- "be later used to get deterministic results.
\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "\n",
- "# Set the seed for reproduction\n",
- "rng = np.random.RandomState(0)\n",
- "\n",
- "# Generate data\n",
- "n_sample = 100\n",
- "data_max, data_min = 1.4, -1.4\n",
- "len_data = data_max - data_min\n",
- "data = rng.rand(n_sample) * len_data - len_data / 2\n",
- "noise = rng.randn(n_sample) * 0.3\n",
- "target = data**3 - 0.5 * data**2 + noise"
+ "We will load a dataset about house prices in California. The dataset consists\n",
+ "of 8 features regarding the demography and geography of districts in\n",
+ "California and the aim is to predict the median house price of each district.\n",
+ "We will use all 8 features to predict the target, the median house price."
]
},
{
@@ -45,8 +25,8 @@
"source": [
"\n",
"
Note
\n",
- "
To ease the plotting, we will create a Pandas dataframe containing the data\n",
- "and target
\n",
+ "
If you want a deeper overview regarding this dataset, you can refer to the\n",
+ "Appendix - Datasets description section at the end of this MOOC.
\n",
"
"
]
},
@@ -56,49 +36,19 @@
"metadata": {},
"outputs": [],
"source": [
- "import pandas as pd\n",
- "\n",
- "full_data = pd.DataFrame({\"data\": data, \"target\": target})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import seaborn as sns\n",
+ "from sklearn.datasets import fetch_california_housing\n",
"\n",
- "_ = sns.scatterplot(\n",
- " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n",
- ")"
+ "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n",
+ "target *= 100 # rescale the target in k$\n",
+ "data.head()"
]
},
{
"cell_type": "markdown",
- "metadata": {
- "lines_to_next_cell": 2
- },
- "source": [
- "We observe that the link between the data `data` and vector `target` is\n",
- "non-linear. For instance, `data` could represent the years of experience\n",
- "(normalized) and `target` the salary (normalized). Therefore, the problem here\n",
- "would be to infer the salary given the years of experience.\n",
- "\n",
- "Using the function `f` defined below, find both the `weight` and the\n",
- "`intercept` that you think will lead to a good linear model. Plot both the\n",
- "data and the predictions of this model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
"metadata": {},
- "outputs": [],
"source": [
- "def f(data, weight=0, intercept=0):\n",
- " target_predict = weight * data + intercept\n",
- " return target_predict"
+ "Now it is your turn to train a linear regression model on this dataset. First,\n",
+ "create a linear regression model."
]
},
{
@@ -108,30 +58,17 @@
"outputs": [],
"source": [
"# solution\n",
- "predictions = f(data, weight=1.2, intercept=-0.2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "ax = sns.scatterplot(\n",
- " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n",
- ")\n",
- "_ = ax.plot(data, predictions)"
+ "from sklearn.linear_model import LinearRegression\n",
+ "\n",
+ "linear_regression = LinearRegression()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Compute the mean squared error for this model"
+ "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n",
+ "as metric. Be sure to *return* the fitted *estimators*."
]
},
{
@@ -141,26 +78,24 @@
"outputs": [],
"source": [
"# solution\n",
- "from sklearn.metrics import mean_squared_error\n",
+ "from sklearn.model_selection import cross_validate\n",
"\n",
- "error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2))\n",
- "print(f\"The MSE is {error}\")"
+ "cv_results = cross_validate(\n",
+ " linear_regression,\n",
+ " data,\n",
+ " target,\n",
+ " scoring=\"neg_mean_absolute_error\",\n",
+ " return_estimator=True,\n",
+ " cv=10,\n",
+ " n_jobs=2,\n",
+ ")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Train a linear regression model on this dataset.\n",
- "\n",
- "\n",
- "
Warning
\n",
- "
In scikit-learn, by convention data (also called X in the scikit-learn\n",
- "documentation) should be a 2D matrix of shape (n_samples, n_features).\n",
- "If data is a 1D vector, you need to reshape it into a matrix with a\n",
- "single column if the vector represents a feature or a single row if the\n",
- "vector represents a sample.
\n",
- "
"
+ "Compute the mean and std of the MAE in thousands of dollars (k$)."
]
},
{
@@ -169,20 +104,25 @@
"metadata": {},
"outputs": [],
"source": [
- "from sklearn.linear_model import LinearRegression\n",
- "\n",
"# solution\n",
- "linear_regression = LinearRegression()\n",
- "data_2d = data.reshape(-1, 1)\n",
- "linear_regression.fit(data_2d, target)"
+ "print(\n",
+ " \"Mean absolute error on testing set: \"\n",
+ " f\"{-cv_results['test_score'].mean():.3f} k$ \u00b1 \"\n",
+ " f\"{cv_results['test_score'].std():.3f}\"\n",
+ ")"
]
},
{
"cell_type": "markdown",
- "metadata": {},
+ "metadata": {
+ "lines_to_next_cell": 2
+ },
"source": [
- "Compute predictions from the linear regression model and plot both the data\n",
- "and the predictions."
+ "Inspect the fitted model using a box plot to show the distribution of values\n",
+ "for the coefficients returned from the cross-validation. Hint: use the\n",
+ "function\n",
+ "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n",
+ "to create a box plot."
]
},
{
@@ -192,7 +132,11 @@
"outputs": [],
"source": [
"# solution\n",
- "predictions = linear_regression.predict(data_2d)"
+ "import pandas as pd\n",
+ "\n",
+ "weights = pd.DataFrame(\n",
+ " [est.coef_ for est in cv_results[\"estimator\"]], columns=data.columns\n",
+ ")"
]
},
{
@@ -205,28 +149,11 @@
},
"outputs": [],
"source": [
- "ax = sns.scatterplot(\n",
- " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n",
- ")\n",
- "_ = ax.plot(data, predictions)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Compute the mean squared error"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "error = mean_squared_error(target, predictions)\n",
- "print(f\"The MSE is {error}\")"
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n",
+ "weights.plot.box(color=color, vert=False)\n",
+ "_ = plt.title(\"Value of linear regression coefficients\")"
]
}
],
diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb
deleted file mode 100644
index 634c43171..000000000
--- a/notebooks/linear_models_sol_03.ipynb
+++ /dev/null
@@ -1,171 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# \ud83d\udcc3 Solution for Exercise M4.03\n",
- "\n",
- "In all previous notebooks, we only used a single feature in `data`. But we\n",
- "have already shown that we could add new features to make the model more\n",
- "expressive by deriving new features, based on the original feature.\n",
- "\n",
- "The aim of this notebook is to train a linear regression algorithm on a\n",
- "dataset with more than a single feature.\n",
- "\n",
- "We will load a dataset about house prices in California. The dataset consists\n",
- "of 8 features regarding the demography and geography of districts in\n",
- "California and the aim is to predict the median house price of each district.\n",
- "We will use all 8 features to predict the target, the median house price."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "
Note
\n",
- "
If you want a deeper overview regarding this dataset, you can refer to the\n",
- "Appendix - Datasets description section at the end of this MOOC.
\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.datasets import fetch_california_housing\n",
- "\n",
- "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n",
- "target *= 100 # rescale the target in k$\n",
- "data.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now it is your turn to train a linear regression model on this dataset. First,\n",
- "create a linear regression model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "from sklearn.linear_model import LinearRegression\n",
- "\n",
- "linear_regression = LinearRegression()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n",
- "as metric. Be sure to *return* the fitted *estimators*."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "from sklearn.model_selection import cross_validate\n",
- "\n",
- "cv_results = cross_validate(\n",
- " linear_regression,\n",
- " data,\n",
- " target,\n",
- " scoring=\"neg_mean_absolute_error\",\n",
- " return_estimator=True,\n",
- " cv=10,\n",
- " n_jobs=2,\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Compute the mean and std of the MAE in thousands of dollars (k$)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "print(\n",
- " \"Mean absolute error on testing set: \"\n",
- " f\"{-cv_results['test_score'].mean():.3f} k$ \u00b1 \"\n",
- " f\"{cv_results['test_score'].std():.3f}\"\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "lines_to_next_cell": 2
- },
- "source": [
- "Inspect the fitted model using a box plot to show the distribution of values\n",
- "for the coefficients returned from the cross-validation. Hint: use the\n",
- "function\n",
- "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n",
- "to create a box plot."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "import pandas as pd\n",
- "\n",
- "weights = pd.DataFrame(\n",
- " [est.coef_ for est in cv_results[\"estimator\"]], columns=data.columns\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt\n",
- "\n",
- "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n",
- "weights.plot.box(color=color, vert=False)\n",
- "_ = plt.title(\"Value of linear regression coefficients\")"
- ]
- }
- ],
- "metadata": {
- "jupytext": {
- "main_language": "python"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
diff --git a/notebooks/linear_models_sol_04.ipynb b/notebooks/linear_models_sol_04.ipynb
deleted file mode 100644
index f49b0c465..000000000
--- a/notebooks/linear_models_sol_04.ipynb
+++ /dev/null
@@ -1,492 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# \ud83d\udcc3 Solution for Exercise M4.04\n",
- "\n",
- "In the previous notebook, we saw the effect of applying some regularization on\n",
- "the coefficient of a linear model.\n",
- "\n",
- "In this exercise, we will study the advantage of using some regularization\n",
- "when dealing with correlated features.\n",
- "\n",
- "We will first create a regression dataset. This dataset will contain 2,000\n",
- "samples and 5 features from which only 2 features will be informative."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.datasets import make_regression\n",
- "\n",
- "data, target, coef = make_regression(\n",
- " n_samples=2_000,\n",
- " n_features=5,\n",
- " n_informative=2,\n",
- " shuffle=False,\n",
- " coef=True,\n",
- " random_state=0,\n",
- " noise=30,\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "When creating the dataset, `make_regression` returns the true coefficient used\n",
- "to generate the dataset. Let's plot this information."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "\n",
- "feature_names = [\n",
- " \"Relevant feature #0\",\n",
- " \"Relevant feature #1\",\n",
- " \"Noisy feature #0\",\n",
- " \"Noisy feature #1\",\n",
- " \"Noisy feature #2\",\n",
- "]\n",
- "coef = pd.Series(coef, index=feature_names)\n",
- "coef.plot.barh()\n",
- "coef"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Create a `LinearRegression` regressor and fit on the entire dataset and check\n",
- "the value of the coefficients. Are the coefficients of the linear regressor\n",
- "close to the coefficients used to generate the dataset?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "from sklearn.linear_model import LinearRegression\n",
- "\n",
- "linear_regression = LinearRegression()\n",
- "linear_regression.fit(data, target)\n",
- "linear_regression.coef_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "feature_names = [\n",
- " \"Relevant feature #0\",\n",
- " \"Relevant feature #1\",\n",
- " \"Noisy feature #0\",\n",
- " \"Noisy feature #1\",\n",
- " \"Noisy feature #2\",\n",
- "]\n",
- "coef = pd.Series(linear_regression.coef_, index=feature_names)\n",
- "_ = coef.plot.barh()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "source": [
- "We see that the coefficients are close to the coefficients used to generate\n",
- "the dataset. The dispersion is indeed cause by the noise injected during the\n",
- "dataset generation."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now, create a new dataset that will be the same as `data` with 4 additional\n",
- "columns that will repeat twice features 0 and 1. This procedure will create\n",
- "perfectly correlated features."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "import numpy as np\n",
- "\n",
- "data = np.concatenate([data, data[:, [0, 1]], data[:, [0, 1]]], axis=1)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Fit again the linear regressor on this new dataset and check the coefficients.\n",
- "What do you observe?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "linear_regression = LinearRegression()\n",
- "linear_regression.fit(data, target)\n",
- "linear_regression.coef_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "feature_names = [\n",
- " \"Relevant feature #0\",\n",
- " \"Relevant feature #1\",\n",
- " \"Noisy feature #0\",\n",
- " \"Noisy feature #1\",\n",
- " \"Noisy feature #2\",\n",
- " \"First repetition of feature #0\",\n",
- " \"First repetition of feature #1\",\n",
- " \"Second repetition of feature #0\",\n",
- " \"Second repetition of feature #1\",\n",
- "]\n",
- "coef = pd.Series(linear_regression.coef_, index=feature_names)\n",
- "_ = coef.plot.barh()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "source": [
- "We see that the coefficient values are far from what one could expect. By\n",
- "repeating the informative features, one would have expected these coefficients\n",
- "to be similarly informative.\n",
- "\n",
- "Instead, we see that some coefficients have a huge norm ~1e14. It indeed means\n",
- "that we try to solve an mathematical ill-posed problem. Indeed, finding\n",
- "coefficients in a linear regression involves inverting the matrix\n",
- "`np.dot(data.T, data)` which is not possible (or lead to high numerical\n",
- "errors)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Create a ridge regressor and fit on the same dataset. Check the coefficients.\n",
- "What do you observe?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "from sklearn.linear_model import Ridge\n",
- "\n",
- "ridge = Ridge()\n",
- "ridge.fit(data, target)\n",
- "ridge.coef_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "coef = pd.Series(ridge.coef_, index=feature_names)\n",
- "_ = coef.plot.barh()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "source": [
- "We see that the penalty applied on the weights give a better results: the\n",
- "values of the coefficients do not suffer from numerical issues. Indeed, the\n",
- "matrix to be inverted internally is `np.dot(data.T, data) + alpha * I`. Adding\n",
- "this penalty `alpha` allow the inversion without numerical issue."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Can you find the relationship between the ridge coefficients and the original\n",
- "coefficients?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "ridge.coef_[:5] * 3"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "source": [
- "Repeating three times each informative features induced to divide the ridge\n",
- "coefficients by three."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "source": [
- "\n",
- "
Tip
\n",
- "
We advise to always use a penalty to shrink the magnitude of the weights\n",
- "toward zero (also called \"l2 penalty\"). In scikit-learn, LogisticRegression\n",
- "applies such penalty by default. However, one needs to use Ridge (and even\n",
- "RidgeCV to tune the parameter alpha) instead of LinearRegression.
\n",
- "
Other kinds of regularizations exist but will not be covered in this course.
\n",
- "
\n",
- "\n",
- "## Dealing with correlation between one-hot encoded features\n",
- "\n",
- "In this section, we will focus on how to deal with correlated features that\n",
- "arise naturally when one-hot encoding categorical features.\n",
- "\n",
- "Let's first load the Ames housing dataset and take a subset of features that\n",
- "are only categorical features."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "from sklearn.model_selection import train_test_split\n",
- "\n",
- "ames_housing = pd.read_csv(\"../datasets/house_prices.csv\", na_values=\"?\")\n",
- "ames_housing = ames_housing.drop(columns=\"Id\")\n",
- "\n",
- "categorical_columns = [\"Street\", \"Foundation\", \"CentralAir\", \"PavedDrive\"]\n",
- "target_name = \"SalePrice\"\n",
- "X, y = ames_housing[categorical_columns], ames_housing[target_name]\n",
- "\n",
- "X_train, X_test, y_train, y_test = train_test_split(\n",
- " X, y, test_size=0.2, random_state=0\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "source": [
- "\n",
- "We previously presented that a `OneHotEncoder` creates as many columns as\n",
- "categories. Therefore, there is always one column (i.e. one encoded category)\n",
- "that can be inferred from the others. Thus, `OneHotEncoder` creates collinear\n",
- "features.\n",
- "\n",
- "We illustrate this behaviour by considering the \"CentralAir\" feature that\n",
- "contains only two categories:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "X_train[\"CentralAir\"]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "from sklearn.preprocessing import OneHotEncoder\n",
- "\n",
- "single_feature = [\"CentralAir\"]\n",
- "encoder = OneHotEncoder(sparse_output=False, dtype=np.int32)\n",
- "X_trans = encoder.fit_transform(X_train[single_feature])\n",
- "X_trans = pd.DataFrame(\n",
- " X_trans,\n",
- " columns=encoder.get_feature_names_out(input_features=single_feature),\n",
- ")\n",
- "X_trans"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "source": [
- "\n",
- "Here, we see that the encoded category \"CentralAir_N\" is the opposite of the\n",
- "encoded category \"CentralAir_Y\". Therefore, we observe that using a\n",
- "`OneHotEncoder` creates two features having the problematic pattern observed\n",
- "earlier in this exercise. Training a linear regression model on such a of\n",
- "one-hot encoded binary feature can therefore lead to numerical problems,\n",
- "especially without regularization. Furthermore, the two one-hot features are\n",
- "redundant as they encode exactly the same information in opposite ways.\n",
- "\n",
- "Using regularization helps to overcome the numerical issues that we\n",
- "highlighted earlier in this exercise.\n",
- "\n",
- "Another strategy is to arbitrarily drop one of the encoded categories.\n",
- "Scikit-learn provides such an option by setting the parameter `drop` in the\n",
- "`OneHotEncoder`. This parameter can be set to `first` to always drop the first\n",
- "encoded category or `binary_only` to only drop a column in the case of binary\n",
- "categories."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "encoder = OneHotEncoder(drop=\"first\", sparse_output=False, dtype=np.int32)\n",
- "X_trans = encoder.fit_transform(X_train[single_feature])\n",
- "X_trans = pd.DataFrame(\n",
- " X_trans,\n",
- " columns=encoder.get_feature_names_out(input_features=single_feature),\n",
- ")\n",
- "X_trans"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "source": [
- "\n",
- "We see that only the second column of the previous encoded data is kept.\n",
- "Dropping one of the one-hot encoded column is a common practice, especially\n",
- "for binary categorical features. Note however that this breaks symmetry\n",
- "between categories and impacts the number of coefficients of the model, their\n",
- "values, and thus their meaning, especially when applying strong\n",
- "regularization.\n",
- "\n",
- "Let's finally illustrate how to use this option is a machine-learning\n",
- "pipeline:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "from sklearn.pipeline import make_pipeline\n",
- "\n",
- "model = make_pipeline(OneHotEncoder(drop=\"first\", dtype=np.int32), Ridge())\n",
- "model.fit(X_train, y_train)\n",
- "n_categories = [X_train[col].nunique() for col in X_train.columns]\n",
- "print(f\"R2 score on the testing set: {model.score(X_test, y_test):.2f}\")\n",
- "print(\n",
- " f\"Our model contains {model[-1].coef_.size} features while \"\n",
- " f\"{sum(n_categories)} categories are originally available.\"\n",
- ")"
- ]
- }
- ],
- "metadata": {
- "jupytext": {
- "main_language": "python"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
diff --git a/notebooks/linear_models_sol_05.ipynb b/notebooks/linear_models_sol_05.ipynb
deleted file mode 100644
index 08bae2e77..000000000
--- a/notebooks/linear_models_sol_05.ipynb
+++ /dev/null
@@ -1,201 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# \ud83d\udcc3 Solution for Exercise M4.05\n",
- "\n",
- "In the previous notebook we set `penalty=\"none\"` to disable regularization\n",
- "entirely. This parameter can also control the **type** of regularization to\n",
- "use, whereas the regularization **strength** is set using the parameter `C`.\n",
- "Setting`penalty=\"none\"` is equivalent to an infinitely large value of `C`. In\n",
- "this exercise, we ask you to train a logistic regression classifier using the\n",
- "`penalty=\"l2\"` regularization (which happens to be the default in\n",
- "scikit-learn) to find by yourself the effect of the parameter `C`.\n",
- "\n",
- "We will start by loading the dataset."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "
Note
\n",
- "
If you want a deeper overview regarding this dataset, you can refer to the\n",
- "Appendix - Datasets description section at the end of this MOOC.
\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "\n",
- "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n",
- "# only keep the Adelie and Chinstrap classes\n",
- "penguins = (\n",
- " penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n",
- ")\n",
- "\n",
- "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n",
- "target_column = \"Species\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.model_selection import train_test_split\n",
- "\n",
- "penguins_train, penguins_test = train_test_split(penguins, random_state=0)\n",
- "\n",
- "data_train = penguins_train[culmen_columns]\n",
- "data_test = penguins_test[culmen_columns]\n",
- "\n",
- "target_train = penguins_train[target_column]\n",
- "target_test = penguins_test[target_column]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "First, let's create our predictive model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.pipeline import make_pipeline\n",
- "from sklearn.preprocessing import StandardScaler\n",
- "from sklearn.linear_model import LogisticRegression\n",
- "\n",
- "logistic_regression = make_pipeline(\n",
- " StandardScaler(), LogisticRegression(penalty=\"l2\")\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Given the following candidates for the `C` parameter, find out the impact of\n",
- "`C` on the classifier decision boundary. You can use\n",
- "`sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the\n",
- "decision function boundary."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "Cs = [0.01, 0.1, 1, 10]\n",
- "\n",
- "# solution\n",
- "import matplotlib.pyplot as plt\n",
- "import seaborn as sns\n",
- "from sklearn.inspection import DecisionBoundaryDisplay\n",
- "\n",
- "for C in Cs:\n",
- " logistic_regression.set_params(logisticregression__C=C)\n",
- " logistic_regression.fit(data_train, target_train)\n",
- " accuracy = logistic_regression.score(data_test, target_test)\n",
- "\n",
- " DecisionBoundaryDisplay.from_estimator(\n",
- " logistic_regression,\n",
- " data_test,\n",
- " response_method=\"predict\",\n",
- " cmap=\"RdBu_r\",\n",
- " alpha=0.5,\n",
- " )\n",
- " sns.scatterplot(\n",
- " data=penguins_test,\n",
- " x=culmen_columns[0],\n",
- " y=culmen_columns[1],\n",
- " hue=target_column,\n",
- " palette=[\"tab:red\", \"tab:blue\"],\n",
- " )\n",
- " plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
- " plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Look at the impact of the `C` hyperparameter on the magnitude of the weights."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "weights_ridge = []\n",
- "for C in Cs:\n",
- " logistic_regression.set_params(logisticregression__C=C)\n",
- " logistic_regression.fit(data_train, target_train)\n",
- " coefs = logistic_regression[-1].coef_[0]\n",
- " weights_ridge.append(pd.Series(coefs, index=culmen_columns))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f\"C: {C}\" for C in Cs])\n",
- "weights_ridge.plot.barh()\n",
- "_ = plt.title(\"LogisticRegression weights depending of C\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "source": [
- "We see that a small `C` will shrink the weights values toward zero. It means\n",
- "that a small `C` provides a more regularized model. Thus, `C` is the inverse\n",
- "of the `alpha` coefficient in the `Ridge` model.\n",
- "\n",
- "Besides, with a strong penalty (i.e. small `C` value), the weight of the\n",
- "feature \"Culmen Depth (mm)\" is almost zero. It explains why the decision\n",
- "separation in the plot is almost perpendicular to the \"Culmen Length (mm)\"\n",
- "feature."
- ]
- }
- ],
- "metadata": {
- "jupytext": {
- "main_language": "python"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
diff --git a/python_scripts/linear_models_ex_02.py b/python_scripts/linear_models_ex_02.py
index 640c44046..f58a1f0fe 100644
--- a/python_scripts/linear_models_ex_02.py
+++ b/python_scripts/linear_models_ex_02.py
@@ -14,100 +14,80 @@
# %% [markdown]
# # 📝 Exercise M4.02
#
-# The goal of this exercise is to build an intuition on what will be the
-# parameters' values of a linear model when the link between the data and the
-# target is non-linear.
+# In the previous notebook, we showed that we can add new features based on the
+# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
+# In that case we only used a single feature in `data`.
#
-# First, we will generate such non-linear data.
+# The aim of this notebook is to train a linear regression algorithm on a
+# dataset with more than a single feature. In such a "multi-dimensional" feature
+# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
+# etc. Products of features are usually called "non-linear or
+# multiplicative interactions" between features.
#
-# ```{tip}
-# `np.random.RandomState` allows to create a random number generator which can
-# be later used to get deterministic results.
-# ```
-
-# %%
-import numpy as np
-
-# Set the seed for reproduction
-rng = np.random.RandomState(0)
-
-# Generate data
-n_sample = 100
-data_max, data_min = 1.4, -1.4
-len_data = data_max - data_min
-data = rng.rand(n_sample) * len_data - len_data / 2
-noise = rng.randn(n_sample) * 0.3
-target = data**3 - 0.5 * data**2 + noise
+# Feature engineering can be an important step of a model pipeline as long as
+# the new features are expected to be predictive. For instance, think of a
+# classification model to decide if a patient has risk of developing a heart
+# disease. This would depend on the patient's Body Mass Index which is defined
+# as `weight / height ** 2`.
+#
+# We load the dataset penguins dataset. We first use a set of 3 numerical
+# features to predict the target, i.e. the body mass of the penguin.
# %% [markdown]
# ```{note}
-# To ease the plotting, we will create a Pandas dataframe containing the data
-# and target
+# If you want a deeper overview regarding this dataset, you can refer to the
+# Appendix - Datasets description section at the end of this MOOC.
# ```
# %%
import pandas as pd
-full_data = pd.DataFrame({"data": data, "target": target})
+penguins = pd.read_csv("../datasets/penguins.csv")
-# %%
-import seaborn as sns
+columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
+target_name = "Body Mass (g)"
-_ = sns.scatterplot(
- data=full_data, x="data", y="target", color="black", alpha=0.5
-)
+# Remove lines with missing values for the columns of interest
+penguins_non_missing = penguins[columns + [target_name]].dropna()
-# %% [markdown]
-# We observe that the link between the data `data` and vector `target` is
-# non-linear. For instance, `data` could represent the years of experience
-# (normalized) and `target` the salary (normalized). Therefore, the problem here
-# would be to infer the salary given the years of experience.
-#
-# Using the function `f` defined below, find both the `weight` and the
-# `intercept` that you think will lead to a good linear model. Plot both the
-# data and the predictions of this model.
-
-
-# %%
-def f(data, weight=0, intercept=0):
- target_predict = weight * data + intercept
- return target_predict
+data = penguins_non_missing[columns]
+target = penguins_non_missing[target_name]
+data.head()
+# %% [markdown]
+# Now it is your turn to train a linear regression model on this dataset. First,
+# create a linear regression model.
# %%
# Write your code here.
# %% [markdown]
-# Compute the mean squared error for this model
+# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
+# as metric.
# %%
# Write your code here.
# %% [markdown]
-# Train a linear regression model on this dataset.
-#
-# ```{warning}
-# In scikit-learn, by convention `data` (also called `X` in the scikit-learn
-# documentation) should be a 2D matrix of shape `(n_samples, n_features)`.
-# If `data` is a 1D vector, you need to reshape it into a matrix with a
-# single column if the vector represents a feature or a single row if the
-# vector represents a sample.
-# ```
+# Compute the mean and std of the MAE in grams (g).
# %%
-from sklearn.linear_model import LinearRegression
-
# Write your code here.
# %% [markdown]
-# Compute predictions from the linear regression model and plot both the data
-# and the predictions.
+# Now create a pipeline using `make_pipeline` consisting of a
+# `PolynomialFeatures` and a linear regression. Set `degree=2` and
+# `interaction_only=True` to the feature engineering step. Remember not to
+# include the bias to avoid redundancies with the linear's regression intercept.
+#
+# Use the same strategy as before to cross-validate such a pipeline.
# %%
# Write your code here.
# %% [markdown]
-# Compute the mean squared error
+# Compute the mean and std of the MAE in grams (g) and compare with the results
+# without feature engineering.
# %%
# Write your code here.
diff --git a/python_scripts/linear_models_ex_03.py b/python_scripts/linear_models_ex_03.py
index 3ab6949a3..9c311e817 100644
--- a/python_scripts/linear_models_ex_03.py
+++ b/python_scripts/linear_models_ex_03.py
@@ -14,24 +14,14 @@
# %% [markdown]
# # 📝 Exercise M4.03
#
-# In the previous notebook, we showed that we can add new features based on the
-# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
-# In that case we only used a single feature in `data`.
+# The parameter `penalty` can control the **type** of regularization to use,
+# whereas the regularization **strength** is set using the parameter `C`.
+# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
+# this exercise, we ask you to train a logistic regression classifier using the
+# `penalty="l2"` regularization (which happens to be the default in
+# scikit-learn) to find by yourself the effect of the parameter `C`.
#
-# The aim of this notebook is to train a linear regression algorithm on a
-# dataset with more than a single feature. In such a "multi-dimensional" feature
-# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
-# etc. Products of features are usually called "non-linear or
-# multiplicative interactions" between features.
-#
-# Feature engineering can be an important step of a model pipeline as long as
-# the new features are expected to be predictive. For instance, think of a
-# classification model to decide if a patient has risk of developing a heart
-# disease. This would depend on the patient's Body Mass Index which is defined
-# as `weight / height ** 2`.
-#
-# We load the dataset penguins dataset. We first use a set of 3 numerical
-# features to predict the target, i.e. the body mass of the penguin.
+# We start by loading the dataset.
# %% [markdown]
# ```{note}
@@ -42,52 +32,51 @@
# %%
import pandas as pd
-penguins = pd.read_csv("../datasets/penguins.csv")
+penguins = pd.read_csv("../datasets/penguins_classification.csv")
+# only keep the Adelie and Chinstrap classes
+penguins = (
+ penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
+)
-columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
-target_name = "Body Mass (g)"
+culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
+target_column = "Species"
-# Remove lines with missing values for the columns of interest
-penguins_non_missing = penguins[columns + [target_name]].dropna()
+# %%
+from sklearn.model_selection import train_test_split
-data = penguins_non_missing[columns]
-target = penguins_non_missing[target_name]
-data.head()
+penguins_train, penguins_test = train_test_split(penguins, random_state=0)
-# %% [markdown]
-# Now it is your turn to train a linear regression model on this dataset. First,
-# create a linear regression model.
+data_train = penguins_train[culmen_columns]
+data_test = penguins_test[culmen_columns]
-# %%
-# Write your code here.
+target_train = penguins_train[target_column]
+target_test = penguins_test[target_column]
# %% [markdown]
-# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
-# as metric.
+# First, let's create our predictive model.
# %%
-# Write your code here.
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression
-# %% [markdown]
-# Compute the mean and std of the MAE in grams (g).
-
-# %%
-# Write your code here.
+logistic_regression = make_pipeline(
+ StandardScaler(), LogisticRegression(penalty="l2")
+)
# %% [markdown]
-# Now create a pipeline using `make_pipeline` consisting of a
-# `PolynomialFeatures` and a linear regression. Set `degree=2` and
-# `interaction_only=True` to the feature engineering step. Remember not to
-# include the bias to avoid redundancies with the linear's regression intercept.
-#
-# Use the same strategy as before to cross-validate such a pipeline.
+# Given the following candidates for the `C` parameter, find out the impact of
+# `C` on the classifier decision boundary. You can use
+# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
+# decision function boundary.
# %%
+Cs = [0.01, 0.1, 1, 10]
+
# Write your code here.
# %% [markdown]
-# Compute the mean and std of the MAE in grams (g) and compare with the results
-# without feature engineering.
+# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
# %%
# Write your code here.
diff --git a/python_scripts/linear_models_ex_04.py b/python_scripts/linear_models_ex_04.py
deleted file mode 100644
index 18191bccf..000000000
--- a/python_scripts/linear_models_ex_04.py
+++ /dev/null
@@ -1,92 +0,0 @@
-# ---
-# jupyter:
-# jupytext:
-# text_representation:
-# extension: .py
-# format_name: percent
-# format_version: '1.3'
-# jupytext_version: 1.14.5
-# kernelspec:
-# display_name: Python 3
-# name: python3
-# ---
-
-# %% [markdown]
-# # 📝 Exercise M4.04
-#
-# In the previous notebook, we saw the effect of applying some regularization on
-# the coefficient of a linear model.
-#
-# In this exercise, we will study the advantage of using some regularization
-# when dealing with correlated features.
-#
-# We will first create a regression dataset. This dataset will contain 2,000
-# samples and 5 features from which only 2 features will be informative.
-
-# %%
-from sklearn.datasets import make_regression
-
-data, target, coef = make_regression(
- n_samples=2_000,
- n_features=5,
- n_informative=2,
- shuffle=False,
- coef=True,
- random_state=0,
- noise=30,
-)
-
-# %% [markdown]
-# When creating the dataset, `make_regression` returns the true coefficient used
-# to generate the dataset. Let's plot this information.
-
-# %%
-import pandas as pd
-
-feature_names = [
- "Relevant feature #0",
- "Relevant feature #1",
- "Noisy feature #0",
- "Noisy feature #1",
- "Noisy feature #2",
-]
-coef = pd.Series(coef, index=feature_names)
-coef.plot.barh()
-coef
-
-# %% [markdown]
-# Create a `LinearRegression` regressor and fit on the entire dataset and check
-# the value of the coefficients. Are the coefficients of the linear regressor
-# close to the coefficients used to generate the dataset?
-
-# %%
-# Write your code here.
-
-# %% [markdown]
-# Now, create a new dataset that will be the same as `data` with 4 additional
-# columns that will repeat twice features 0 and 1. This procedure will create
-# perfectly correlated features.
-
-# %%
-# Write your code here.
-
-# %% [markdown]
-# Fit again the linear regressor on this new dataset and check the coefficients.
-# What do you observe?
-
-# %%
-# Write your code here.
-
-# %% [markdown]
-# Create a ridge regressor and fit on the same dataset. Check the coefficients.
-# What do you observe?
-
-# %%
-# Write your code here.
-
-# %% [markdown]
-# Can you find the relationship between the ridge coefficients and the original
-# coefficients?
-
-# %%
-# Write your code here.
diff --git a/python_scripts/linear_models_ex_05.py b/python_scripts/linear_models_ex_05.py
deleted file mode 100644
index 1c36b83c2..000000000
--- a/python_scripts/linear_models_ex_05.py
+++ /dev/null
@@ -1,83 +0,0 @@
-# ---
-# jupyter:
-# jupytext:
-# text_representation:
-# extension: .py
-# format_name: percent
-# format_version: '1.3'
-# jupytext_version: 1.14.5
-# kernelspec:
-# display_name: Python 3
-# name: python3
-# ---
-
-# %% [markdown]
-# # 📝 Exercise M4.05
-#
-# In the previous notebook we set `penalty="none"` to disable regularization
-# entirely. This parameter can also control the **type** of regularization to
-# use, whereas the regularization **strength** is set using the parameter `C`.
-# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
-# this exercise, we ask you to train a logistic regression classifier using the
-# `penalty="l2"` regularization (which happens to be the default in
-# scikit-learn) to find by yourself the effect of the parameter `C`.
-#
-# We will start by loading the dataset.
-
-# %% [markdown]
-# ```{note}
-# If you want a deeper overview regarding this dataset, you can refer to the
-# Appendix - Datasets description section at the end of this MOOC.
-# ```
-
-# %%
-import pandas as pd
-
-penguins = pd.read_csv("../datasets/penguins_classification.csv")
-# only keep the Adelie and Chinstrap classes
-penguins = (
- penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
-)
-
-culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
-target_column = "Species"
-
-# %%
-from sklearn.model_selection import train_test_split
-
-penguins_train, penguins_test = train_test_split(penguins, random_state=0)
-
-data_train = penguins_train[culmen_columns]
-data_test = penguins_test[culmen_columns]
-
-target_train = penguins_train[target_column]
-target_test = penguins_test[target_column]
-
-# %% [markdown]
-# First, let's create our predictive model.
-
-# %%
-from sklearn.pipeline import make_pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.linear_model import LogisticRegression
-
-logistic_regression = make_pipeline(
- StandardScaler(), LogisticRegression(penalty="l2")
-)
-
-# %% [markdown]
-# Given the following candidates for the `C` parameter, find out the impact of
-# `C` on the classifier decision boundary. You can use
-# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
-# decision function boundary.
-
-# %%
-Cs = [0.01, 0.1, 1, 10]
-
-# Write your code here.
-
-# %% [markdown]
-# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
-
-# %%
-# Write your code here.
diff --git a/python_scripts/linear_models_sol_02.py b/python_scripts/linear_models_sol_02.py
index d62a4b983..3abc476da 100644
--- a/python_scripts/linear_models_sol_02.py
+++ b/python_scripts/linear_models_sol_02.py
@@ -8,123 +8,127 @@
# %% [markdown]
# # 📃 Solution for Exercise M4.02
#
-# The goal of this exercise is to build an intuition on what will be the
-# parameters' values of a linear model when the link between the data and the
-# target is non-linear.
+# In the previous notebook, we showed that we can add new features based on the
+# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
+# In that case we only used a single feature in `data`.
#
-# First, we will generate such non-linear data.
+# The aim of this notebook is to train a linear regression algorithm on a
+# dataset with more than a single feature. In such a "multi-dimensional" feature
+# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
+# etc. Products of features are usually called "non-linear or
+# multiplicative interactions" between features.
#
-# ```{tip}
-# `np.random.RandomState` allows to create a random number generator which can
-# be later used to get deterministic results.
-# ```
-
-# %%
-import numpy as np
-
-# Set the seed for reproduction
-rng = np.random.RandomState(0)
-
-# Generate data
-n_sample = 100
-data_max, data_min = 1.4, -1.4
-len_data = data_max - data_min
-data = rng.rand(n_sample) * len_data - len_data / 2
-noise = rng.randn(n_sample) * 0.3
-target = data**3 - 0.5 * data**2 + noise
+# Feature engineering can be an important step of a model pipeline as long as
+# the new features are expected to be predictive. For instance, think of a
+# classification model to decide if a patient has risk of developing a heart
+# disease. This would depend on the patient's Body Mass Index which is defined
+# as `weight / height ** 2`.
+#
+# We load the dataset penguins dataset. We first use a set of 3 numerical
+# features to predict the target, i.e. the body mass of the penguin.
# %% [markdown]
# ```{note}
-# To ease the plotting, we will create a Pandas dataframe containing the data
-# and target
+# If you want a deeper overview regarding this dataset, you can refer to the
+# Appendix - Datasets description section at the end of this MOOC.
# ```
# %%
import pandas as pd
-full_data = pd.DataFrame({"data": data, "target": target})
-
-# %%
-import seaborn as sns
-
-_ = sns.scatterplot(
- data=full_data, x="data", y="target", color="black", alpha=0.5
-)
+penguins = pd.read_csv("../datasets/penguins.csv")
-# %% [markdown]
-# We observe that the link between the data `data` and vector `target` is
-# non-linear. For instance, `data` could represent the years of experience
-# (normalized) and `target` the salary (normalized). Therefore, the problem here
-# would be to infer the salary given the years of experience.
-#
-# Using the function `f` defined below, find both the `weight` and the
-# `intercept` that you think will lead to a good linear model. Plot both the
-# data and the predictions of this model.
+columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
+target_name = "Body Mass (g)"
+# Remove lines with missing values for the columns of interest
+penguins_non_missing = penguins[columns + [target_name]].dropna()
-# %%
-def f(data, weight=0, intercept=0):
- target_predict = weight * data + intercept
- return target_predict
+data = penguins_non_missing[columns]
+target = penguins_non_missing[target_name]
+data.head()
+# %% [markdown]
+# Now it is your turn to train a linear regression model on this dataset. First,
+# create a linear regression model.
# %%
# solution
-predictions = f(data, weight=1.2, intercept=-0.2)
+from sklearn.linear_model import LinearRegression
-# %% tags=["solution"]
-ax = sns.scatterplot(
- data=full_data, x="data", y="target", color="black", alpha=0.5
-)
-_ = ax.plot(data, predictions)
+linear_regression = LinearRegression()
# %% [markdown]
-# Compute the mean squared error for this model
+# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
+# as metric.
# %%
# solution
-from sklearn.metrics import mean_squared_error
-
-error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2))
-print(f"The MSE is {error}")
+from sklearn.model_selection import cross_validate
+
+cv_results = cross_validate(
+ linear_regression,
+ data,
+ target,
+ cv=10,
+ scoring="neg_mean_absolute_error",
+ n_jobs=2,
+)
# %% [markdown]
-# Train a linear regression model on this dataset.
-#
-# ```{warning}
-# In scikit-learn, by convention `data` (also called `X` in the scikit-learn
-# documentation) should be a 2D matrix of shape `(n_samples, n_features)`.
-# If `data` is a 1D vector, you need to reshape it into a matrix with a
-# single column if the vector represents a feature or a single row if the
-# vector represents a sample.
-# ```
+# Compute the mean and std of the MAE in grams (g).
# %%
-from sklearn.linear_model import LinearRegression
-
# solution
-linear_regression = LinearRegression()
-data_2d = data.reshape(-1, 1)
-linear_regression.fit(data_2d, target)
+print(
+ "Mean absolute error on testing set with original features: "
+ f"{-cv_results['test_score'].mean():.3f} ± "
+ f"{cv_results['test_score'].std():.3f} g"
+)
# %% [markdown]
-# Compute predictions from the linear regression model and plot both the data
-# and the predictions.
+# Now create a pipeline using `make_pipeline` consisting of a
+# `PolynomialFeatures` and a linear regression. Set `degree=2` and
+# `interaction_only=True` to the feature engineering step. Remember not to
+# include the bias to avoid redundancies with the linear's regression intercept.
+#
+# Use the same strategy as before to cross-validate such a pipeline.
# %%
# solution
-predictions = linear_regression.predict(data_2d)
+from sklearn.preprocessing import PolynomialFeatures
+from sklearn.pipeline import make_pipeline
-# %% tags=["solution"]
-ax = sns.scatterplot(
- data=full_data, x="data", y="target", color="black", alpha=0.5
+poly_features = PolynomialFeatures(
+ degree=2, include_bias=False, interaction_only=True
+)
+linear_regression_interactions = make_pipeline(
+ poly_features, linear_regression
+)
+
+cv_results = cross_validate(
+ linear_regression_interactions,
+ data,
+ target,
+ cv=10,
+ scoring="neg_mean_absolute_error",
+ n_jobs=2,
)
-_ = ax.plot(data, predictions)
# %% [markdown]
-# Compute the mean squared error
+# Compute the mean and std of the MAE in grams (g) and compare with the results
+# without feature engineering.
# %%
# solution
-error = mean_squared_error(target, predictions)
-print(f"The MSE is {error}")
+print(
+ "Mean absolute error on testing set with interactions: "
+ f"{-cv_results['test_score'].mean():.3f} ± "
+ f"{cv_results['test_score'].std():.3f} g"
+)
+
+# %% [markdown] tags=["solution"]
+# We observe that the mean absolute error is lower and less spread with the
+# enriched features. In this case the "interactions" are indeed predictive. In
+# the following notebook we will see what happens when the enriched features are
+# non-predictive and how to deal with this case.
diff --git a/python_scripts/linear_models_sol_03.py b/python_scripts/linear_models_sol_03.py
index 0cacfcf0d..d789c8522 100644
--- a/python_scripts/linear_models_sol_03.py
+++ b/python_scripts/linear_models_sol_03.py
@@ -8,24 +8,14 @@
# %% [markdown]
# # 📃 Solution for Exercise M4.03
#
-# In the previous notebook, we showed that we can add new features based on the
-# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
-# In that case we only used a single feature in `data`.
+# The parameter `penalty` can control the **type** of regularization to use,
+# whereas the regularization **strength** is set using the parameter `C`.
+# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
+# this exercise, we ask you to train a logistic regression classifier using the
+# `penalty="l2"` regularization (which happens to be the default in
+# scikit-learn) to find by yourself the effect of the parameter `C`.
#
-# The aim of this notebook is to train a linear regression algorithm on a
-# dataset with more than a single feature. In such a "multi-dimensional" feature
-# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
-# etc. Products of features are usually called "non-linear or
-# multiplicative interactions" between features.
-#
-# Feature engineering can be an important step of a model pipeline as long as
-# the new features are expected to be predictive. For instance, think of a
-# classification model to decide if a patient has risk of developing a heart
-# disease. This would depend on the patient's Body Mass Index which is defined
-# as `weight / height ** 2`.
-#
-# We load the dataset penguins dataset. We first use a set of 3 numerical
-# features to predict the target, i.e. the body mass of the penguin.
+# We start by loading the dataset.
# %% [markdown]
# ```{note}
@@ -36,99 +26,97 @@
# %%
import pandas as pd
-penguins = pd.read_csv("../datasets/penguins.csv")
-
-columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
-target_name = "Body Mass (g)"
-
-# Remove lines with missing values for the columns of interest
-penguins_non_missing = penguins[columns + [target_name]].dropna()
-
-data = penguins_non_missing[columns]
-target = penguins_non_missing[target_name]
-data.head()
+penguins = pd.read_csv("../datasets/penguins_classification.csv")
+# only keep the Adelie and Chinstrap classes
+penguins = (
+ penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
+)
-# %% [markdown]
-# Now it is your turn to train a linear regression model on this dataset. First,
-# create a linear regression model.
+culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
+target_column = "Species"
# %%
-# solution
-from sklearn.linear_model import LinearRegression
+from sklearn.model_selection import train_test_split
-linear_regression = LinearRegression()
+penguins_train, penguins_test = train_test_split(penguins, random_state=0)
-# %% [markdown]
-# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
-# as metric.
+data_train = penguins_train[culmen_columns]
+data_test = penguins_test[culmen_columns]
-# %%
-# solution
-from sklearn.model_selection import cross_validate
-
-cv_results = cross_validate(
- linear_regression,
- data,
- target,
- cv=10,
- scoring="neg_mean_absolute_error",
- n_jobs=2,
-)
+target_train = penguins_train[target_column]
+target_test = penguins_test[target_column]
# %% [markdown]
-# Compute the mean and std of the MAE in grams (g).
+# First, let's create our predictive model.
# %%
-# solution
-print(
- "Mean absolute error on testing set with original features: "
- f"{-cv_results['test_score'].mean():.3f} ± "
- f"{cv_results['test_score'].std():.3f} g"
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression
+
+logistic_regression = make_pipeline(
+ StandardScaler(), LogisticRegression(penalty="l2")
)
# %% [markdown]
-# Now create a pipeline using `make_pipeline` consisting of a
-# `PolynomialFeatures` and a linear regression. Set `degree=2` and
-# `interaction_only=True` to the feature engineering step. Remember not to
-# include the bias to avoid redundancies with the linear's regression intercept.
-#
-# Use the same strategy as before to cross-validate such a pipeline.
+# Given the following candidates for the `C` parameter, find out the impact of
+# `C` on the classifier decision boundary. You can use
+# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
+# decision function boundary.
# %%
-# solution
-from sklearn.preprocessing import PolynomialFeatures
-from sklearn.pipeline import make_pipeline
-
-poly_features = PolynomialFeatures(
- degree=2, include_bias=False, interaction_only=True
-)
-linear_regression_interactions = make_pipeline(
- poly_features, linear_regression
-)
+Cs = [0.01, 0.1, 1, 10]
-cv_results = cross_validate(
- linear_regression_interactions,
- data,
- target,
- cv=10,
- scoring="neg_mean_absolute_error",
- n_jobs=2,
-)
+# solution
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.inspection import DecisionBoundaryDisplay
+
+for C in Cs:
+ logistic_regression.set_params(logisticregression__C=C)
+ logistic_regression.fit(data_train, target_train)
+ accuracy = logistic_regression.score(data_test, target_test)
+
+ DecisionBoundaryDisplay.from_estimator(
+ logistic_regression,
+ data_test,
+ response_method="predict",
+ cmap="RdBu_r",
+ alpha=0.5,
+ )
+ sns.scatterplot(
+ data=penguins_test,
+ x=culmen_columns[0],
+ y=culmen_columns[1],
+ hue=target_column,
+ palette=["tab:red", "tab:blue"],
+ )
+ plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
+ plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
# %% [markdown]
-# Compute the mean and std of the MAE in grams (g) and compare with the results
-# without feature engineering.
+# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
# %%
# solution
-print(
- "Mean absolute error on testing set with interactions: "
- f"{-cv_results['test_score'].mean():.3f} ± "
- f"{cv_results['test_score'].std():.3f} g"
-)
+weights_ridge = []
+for C in Cs:
+ logistic_regression.set_params(logisticregression__C=C)
+ logistic_regression.fit(data_train, target_train)
+ coefs = logistic_regression[-1].coef_[0]
+ weights_ridge.append(pd.Series(coefs, index=culmen_columns))
+
+# %% tags=["solution"]
+weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs])
+weights_ridge.plot.barh()
+_ = plt.title("LogisticRegression weights depending of C")
# %% [markdown] tags=["solution"]
-# We observe that the mean absolute error is lower and less spread with the
-# enriched features. In this case the "interactions" are indeed predictive. In
-# the following notebook we will see what happens when the enriched features are
-# non-predictive and how to deal with this case.
+# We see that a small `C` will shrink the weights values toward zero. It means
+# that a small `C` provides a more regularized model. Thus, `C` is the inverse
+# of the `alpha` coefficient in the `Ridge` model.
+#
+# Besides, with a strong penalty (i.e. small `C` value), the weight of the
+# feature "Culmen Depth (mm)" is almost zero. It explains why the decision
+# separation in the plot is almost perpendicular to the "Culmen Length (mm)"
+# feature.
diff --git a/python_scripts/linear_models_sol_04.py b/python_scripts/linear_models_sol_04.py
deleted file mode 100644
index a759c3d24..000000000
--- a/python_scripts/linear_models_sol_04.py
+++ /dev/null
@@ -1,269 +0,0 @@
-# ---
-# jupyter:
-# kernelspec:
-# display_name: Python 3
-# name: python3
-# ---
-
-# %% [markdown]
-# # 📃 Solution for Exercise M4.04
-#
-# In the previous notebook, we saw the effect of applying some regularization on
-# the coefficient of a linear model.
-#
-# In this exercise, we will study the advantage of using some regularization
-# when dealing with correlated features.
-#
-# We will first create a regression dataset. This dataset will contain 2,000
-# samples and 5 features from which only 2 features will be informative.
-
-# %%
-from sklearn.datasets import make_regression
-
-data, target, coef = make_regression(
- n_samples=2_000,
- n_features=5,
- n_informative=2,
- shuffle=False,
- coef=True,
- random_state=0,
- noise=30,
-)
-
-# %% [markdown]
-# When creating the dataset, `make_regression` returns the true coefficient used
-# to generate the dataset. Let's plot this information.
-
-# %%
-import pandas as pd
-
-feature_names = [
- "Relevant feature #0",
- "Relevant feature #1",
- "Noisy feature #0",
- "Noisy feature #1",
- "Noisy feature #2",
-]
-coef = pd.Series(coef, index=feature_names)
-coef.plot.barh()
-coef
-
-# %% [markdown]
-# Create a `LinearRegression` regressor and fit on the entire dataset and check
-# the value of the coefficients. Are the coefficients of the linear regressor
-# close to the coefficients used to generate the dataset?
-
-# %%
-# solution
-from sklearn.linear_model import LinearRegression
-
-linear_regression = LinearRegression()
-linear_regression.fit(data, target)
-linear_regression.coef_
-
-# %% tags=["solution"]
-feature_names = [
- "Relevant feature #0",
- "Relevant feature #1",
- "Noisy feature #0",
- "Noisy feature #1",
- "Noisy feature #2",
-]
-coef = pd.Series(linear_regression.coef_, index=feature_names)
-_ = coef.plot.barh()
-
-# %% [markdown] tags=["solution"]
-# We see that the coefficients are close to the coefficients used to generate
-# the dataset. The dispersion is indeed cause by the noise injected during the
-# dataset generation.
-
-# %% [markdown]
-# Now, create a new dataset that will be the same as `data` with 4 additional
-# columns that will repeat twice features 0 and 1. This procedure will create
-# perfectly correlated features.
-
-# %%
-# solution
-import numpy as np
-
-data = np.concatenate([data, data[:, [0, 1]], data[:, [0, 1]]], axis=1)
-
-# %% [markdown]
-# Fit again the linear regressor on this new dataset and check the coefficients.
-# What do you observe?
-
-# %%
-# solution
-linear_regression = LinearRegression()
-linear_regression.fit(data, target)
-linear_regression.coef_
-
-# %% tags=["solution"]
-feature_names = [
- "Relevant feature #0",
- "Relevant feature #1",
- "Noisy feature #0",
- "Noisy feature #1",
- "Noisy feature #2",
- "First repetition of feature #0",
- "First repetition of feature #1",
- "Second repetition of feature #0",
- "Second repetition of feature #1",
-]
-coef = pd.Series(linear_regression.coef_, index=feature_names)
-_ = coef.plot.barh()
-
-# %% [markdown] tags=["solution"]
-# We see that the coefficient values are far from what one could expect. By
-# repeating the informative features, one would have expected these coefficients
-# to be similarly informative.
-#
-# Instead, we see that some coefficients have a huge norm ~1e14. It indeed means
-# that we try to solve an mathematical ill-posed problem. Indeed, finding
-# coefficients in a linear regression involves inverting the matrix
-# `np.dot(data.T, data)` which is not possible (or lead to high numerical
-# errors).
-
-# %% [markdown]
-# Create a ridge regressor and fit on the same dataset. Check the coefficients.
-# What do you observe?
-
-# %%
-# solution
-from sklearn.linear_model import Ridge
-
-ridge = Ridge()
-ridge.fit(data, target)
-ridge.coef_
-
-# %% tags=["solution"]
-coef = pd.Series(ridge.coef_, index=feature_names)
-_ = coef.plot.barh()
-
-# %% [markdown] tags=["solution"]
-# We see that the penalty applied on the weights give a better results: the
-# values of the coefficients do not suffer from numerical issues. Indeed, the
-# matrix to be inverted internally is `np.dot(data.T, data) + alpha * I`. Adding
-# this penalty `alpha` allow the inversion without numerical issue.
-
-# %% [markdown]
-# Can you find the relationship between the ridge coefficients and the original
-# coefficients?
-
-# %%
-# solution
-ridge.coef_[:5] * 3
-
-# %% [markdown] tags=["solution"]
-# Repeating three times each informative features induced to divide the ridge
-# coefficients by three.
-
-# %% [markdown] tags=["solution"]
-# ```{tip}
-# We advise to always use a penalty to shrink the magnitude of the weights
-# toward zero (also called "l2 penalty"). In scikit-learn, `LogisticRegression`
-# applies such penalty by default. However, one needs to use `Ridge` (and even
-# `RidgeCV` to tune the parameter `alpha`) instead of `LinearRegression`.
-#
-# Other kinds of regularizations exist but will not be covered in this course.
-# ```
-#
-# ## Dealing with correlation between one-hot encoded features
-#
-# In this section, we will focus on how to deal with correlated features that
-# arise naturally when one-hot encoding categorical features.
-#
-# Let's first load the Ames housing dataset and take a subset of features that
-# are only categorical features.
-
-# %% tags=["solution"]
-import pandas as pd
-from sklearn.model_selection import train_test_split
-
-ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
-ames_housing = ames_housing.drop(columns="Id")
-
-categorical_columns = ["Street", "Foundation", "CentralAir", "PavedDrive"]
-target_name = "SalePrice"
-X, y = ames_housing[categorical_columns], ames_housing[target_name]
-
-X_train, X_test, y_train, y_test = train_test_split(
- X, y, test_size=0.2, random_state=0
-)
-
-# %% [markdown] tags=["solution"]
-#
-# We previously presented that a `OneHotEncoder` creates as many columns as
-# categories. Therefore, there is always one column (i.e. one encoded category)
-# that can be inferred from the others. Thus, `OneHotEncoder` creates collinear
-# features.
-#
-# We illustrate this behaviour by considering the "CentralAir" feature that
-# contains only two categories:
-
-# %% tags=["solution"]
-X_train["CentralAir"]
-
-# %% tags=["solution"]
-from sklearn.preprocessing import OneHotEncoder
-
-single_feature = ["CentralAir"]
-encoder = OneHotEncoder(sparse_output=False, dtype=np.int32)
-X_trans = encoder.fit_transform(X_train[single_feature])
-X_trans = pd.DataFrame(
- X_trans,
- columns=encoder.get_feature_names_out(input_features=single_feature),
-)
-X_trans
-
-# %% [markdown] tags=["solution"]
-#
-# Here, we see that the encoded category "CentralAir_N" is the opposite of the
-# encoded category "CentralAir_Y". Therefore, we observe that using a
-# `OneHotEncoder` creates two features having the problematic pattern observed
-# earlier in this exercise. Training a linear regression model on such a of
-# one-hot encoded binary feature can therefore lead to numerical problems,
-# especially without regularization. Furthermore, the two one-hot features are
-# redundant as they encode exactly the same information in opposite ways.
-#
-# Using regularization helps to overcome the numerical issues that we
-# highlighted earlier in this exercise.
-#
-# Another strategy is to arbitrarily drop one of the encoded categories.
-# Scikit-learn provides such an option by setting the parameter `drop` in the
-# `OneHotEncoder`. This parameter can be set to `first` to always drop the first
-# encoded category or `binary_only` to only drop a column in the case of binary
-# categories.
-
-# %% tags=["solution"]
-encoder = OneHotEncoder(drop="first", sparse_output=False, dtype=np.int32)
-X_trans = encoder.fit_transform(X_train[single_feature])
-X_trans = pd.DataFrame(
- X_trans,
- columns=encoder.get_feature_names_out(input_features=single_feature),
-)
-X_trans
-
-# %% [markdown] tags=["solution"]
-#
-# We see that only the second column of the previous encoded data is kept.
-# Dropping one of the one-hot encoded column is a common practice, especially
-# for binary categorical features. Note however that this breaks symmetry
-# between categories and impacts the number of coefficients of the model, their
-# values, and thus their meaning, especially when applying strong
-# regularization.
-#
-# Let's finally illustrate how to use this option is a machine-learning
-# pipeline:
-
-# %% tags=["solution"]
-from sklearn.pipeline import make_pipeline
-
-model = make_pipeline(OneHotEncoder(drop="first", dtype=np.int32), Ridge())
-model.fit(X_train, y_train)
-n_categories = [X_train[col].nunique() for col in X_train.columns]
-print(f"R2 score on the testing set: {model.score(X_test, y_test):.2f}")
-print(
- f"Our model contains {model[-1].coef_.size} features while "
- f"{sum(n_categories)} categories are originally available."
-)
diff --git a/python_scripts/linear_models_sol_05.py b/python_scripts/linear_models_sol_05.py
deleted file mode 100644
index bc4a15df1..000000000
--- a/python_scripts/linear_models_sol_05.py
+++ /dev/null
@@ -1,123 +0,0 @@
-# ---
-# jupyter:
-# kernelspec:
-# display_name: Python 3
-# name: python3
-# ---
-
-# %% [markdown]
-# # 📃 Solution for Exercise M4.05
-#
-# In the previous notebook we set `penalty="none"` to disable regularization
-# entirely. This parameter can also control the **type** of regularization to
-# use, whereas the regularization **strength** is set using the parameter `C`.
-# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
-# this exercise, we ask you to train a logistic regression classifier using the
-# `penalty="l2"` regularization (which happens to be the default in
-# scikit-learn) to find by yourself the effect of the parameter `C`.
-#
-# We will start by loading the dataset.
-
-# %% [markdown]
-# ```{note}
-# If you want a deeper overview regarding this dataset, you can refer to the
-# Appendix - Datasets description section at the end of this MOOC.
-# ```
-
-# %%
-import pandas as pd
-
-penguins = pd.read_csv("../datasets/penguins_classification.csv")
-# only keep the Adelie and Chinstrap classes
-penguins = (
- penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
-)
-
-culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
-target_column = "Species"
-
-# %%
-from sklearn.model_selection import train_test_split
-
-penguins_train, penguins_test = train_test_split(penguins, random_state=0)
-
-data_train = penguins_train[culmen_columns]
-data_test = penguins_test[culmen_columns]
-
-target_train = penguins_train[target_column]
-target_test = penguins_test[target_column]
-
-# %% [markdown]
-# First, let's create our predictive model.
-
-# %%
-from sklearn.pipeline import make_pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.linear_model import LogisticRegression
-
-logistic_regression = make_pipeline(
- StandardScaler(), LogisticRegression(penalty="l2")
-)
-
-# %% [markdown]
-# Given the following candidates for the `C` parameter, find out the impact of
-# `C` on the classifier decision boundary. You can use
-# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
-# decision function boundary.
-
-# %%
-Cs = [0.01, 0.1, 1, 10]
-
-# solution
-import matplotlib.pyplot as plt
-import seaborn as sns
-from sklearn.inspection import DecisionBoundaryDisplay
-
-for C in Cs:
- logistic_regression.set_params(logisticregression__C=C)
- logistic_regression.fit(data_train, target_train)
- accuracy = logistic_regression.score(data_test, target_test)
-
- DecisionBoundaryDisplay.from_estimator(
- logistic_regression,
- data_test,
- response_method="predict",
- cmap="RdBu_r",
- alpha=0.5,
- )
- sns.scatterplot(
- data=penguins_test,
- x=culmen_columns[0],
- y=culmen_columns[1],
- hue=target_column,
- palette=["tab:red", "tab:blue"],
- )
- plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
- plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
-
-# %% [markdown]
-# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
-
-# %%
-# solution
-weights_ridge = []
-for C in Cs:
- logistic_regression.set_params(logisticregression__C=C)
- logistic_regression.fit(data_train, target_train)
- coefs = logistic_regression[-1].coef_[0]
- weights_ridge.append(pd.Series(coefs, index=culmen_columns))
-
-# %% tags=["solution"]
-weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs])
-weights_ridge.plot.barh()
-_ = plt.title("LogisticRegression weights depending of C")
-
-# %% [markdown] tags=["solution"]
-# We see that a small `C` will shrink the weights values toward zero. It means
-# that a small `C` provides a more regularized model. Thus, `C` is the inverse
-# of the `alpha` coefficient in the `Ridge` model.
-#
-# Besides, with a strong penalty (i.e. small `C` value), the weight of the
-# feature "Culmen Depth (mm)" is almost zero. It explains why the decision
-# separation in the plot is almost perpendicular to the "Culmen Length (mm)"
-# feature.
diff --git a/python_scripts/logistic_regression.py b/python_scripts/logistic_regression.py
index 3156ebda0..45487341b 100644
--- a/python_scripts/logistic_regression.py
+++ b/python_scripts/logistic_regression.py
@@ -78,9 +78,7 @@
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
-logistic_regression = make_pipeline(
- StandardScaler(), LogisticRegression(penalty=None)
-)
+logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())
logistic_regression.fit(data_train, target_train)
accuracy = logistic_regression.score(data_test, target_test)
print(f"Accuracy on test set: {accuracy:.3f}")
@@ -124,8 +122,7 @@
# %% [markdown]
# Thus, we see that our decision function is represented by a line separating
-# the 2 classes. We should also note that we did not impose any regularization
-# by setting the parameter `penalty` to `'none'`.
+# the 2 classes.
#
# Since the line is oblique, it means that we used a combination of both
# features: