diff --git a/jupyter-book/_toc.yml b/jupyter-book/_toc.yml
index 5aea17ca2..80bb88aa3 100644
--- a/jupyter-book/_toc.yml
+++ b/jupyter-book/_toc.yml
@@ -99,11 +99,9 @@ parts:
- file: linear_models/linear_models_quiz_m4_02
- file: linear_models/linear_models_non_linear_index
sections:
+ - file: python_scripts/linear_regression_non_linear_link
- file: python_scripts/linear_models_ex_02
- file: python_scripts/linear_models_sol_02
- - file: python_scripts/linear_regression_non_linear_link
- - file: python_scripts/linear_models_ex_03
- - file: python_scripts/linear_models_sol_03
- file: python_scripts/logistic_regression_non_linear
- file: linear_models/linear_models_quiz_m4_03
- file: linear_models/linear_models_regularization_index
@@ -111,8 +109,8 @@ parts:
- file: linear_models/regularized_linear_models_slides
- file: python_scripts/linear_models_regularization
- file: linear_models/linear_models_quiz_m4_04
- - file: python_scripts/linear_models_ex_04
- - file: python_scripts/linear_models_sol_04
+ - file: python_scripts/linear_models_ex_03
+ - file: python_scripts/linear_models_sol_03
- file: linear_models/linear_models_quiz_m4_05
- file: linear_models/linear_models_wrap_up_quiz
- file: linear_models/linear_models_module_take_away
diff --git a/notebooks/linear_models_ex_02.ipynb b/notebooks/linear_models_ex_02.ipynb
index c9c0aad96..4cf750e81 100644
--- a/notebooks/linear_models_ex_02.ipynb
+++ b/notebooks/linear_models_ex_02.ipynb
@@ -4,39 +4,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# \ud83d\udcdd Exercise M4.02\n",
+ "# \ud83d\udcdd Exercise M4.03\n",
"\n",
- "The goal of this exercise is to build an intuition on what will be the\n",
- "parameters' values of a linear model when the link between the data and the\n",
- "target is non-linear.\n",
+ "In all previous notebooks, we only used a single feature in `data`. But we\n",
+ "have already shown that we could add new features to make the model more\n",
+ "expressive by deriving new features, based on the original feature.\n",
"\n",
- "First, we will generate such non-linear data.\n",
+ "The aim of this notebook is to train a linear regression algorithm on a\n",
+ "dataset with more than a single feature.\n",
"\n",
- "
\n",
- "
Tip
\n",
- "
np.random.RandomState allows to create a random number generator which can\n",
- "be later used to get deterministic results.
\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "\n",
- "# Set the seed for reproduction\n",
- "rng = np.random.RandomState(0)\n",
- "\n",
- "# Generate data\n",
- "n_sample = 100\n",
- "data_max, data_min = 1.4, -1.4\n",
- "len_data = data_max - data_min\n",
- "data = rng.rand(n_sample) * len_data - len_data / 2\n",
- "noise = rng.randn(n_sample) * 0.3\n",
- "target = data**3 - 0.5 * data**2 + noise"
+ "We will load a dataset about house prices in California. The dataset consists\n",
+ "of 8 features regarding the demography and geography of districts in\n",
+ "California and the aim is to predict the median house price of each district.\n",
+ "We will use all 8 features to predict the target, the median house price."
]
},
{
@@ -45,8 +25,8 @@
"source": [
"\n",
"
Note
\n",
- "
To ease the plotting, we will create a Pandas dataframe containing the data\n",
- "and target
\n",
+ "
If you want a deeper overview regarding this dataset, you can refer to the\n",
+ "Appendix - Datasets description section at the end of this MOOC.
\n",
"
"
]
},
@@ -56,65 +36,19 @@
"metadata": {},
"outputs": [],
"source": [
- "import pandas as pd\n",
+ "from sklearn.datasets import fetch_california_housing\n",
"\n",
- "full_data = pd.DataFrame({\"data\": data, \"target\": target})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import seaborn as sns\n",
- "\n",
- "_ = sns.scatterplot(\n",
- " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "lines_to_next_cell": 2
- },
- "source": [
- "We observe that the link between the data `data` and vector `target` is\n",
- "non-linear. For instance, `data` could represent the years of experience\n",
- "(normalized) and `target` the salary (normalized). Therefore, the problem here\n",
- "would be to infer the salary given the years of experience.\n",
- "\n",
- "Using the function `f` defined below, find both the `weight` and the\n",
- "`intercept` that you think will lead to a good linear model. Plot both the\n",
- "data and the predictions of this model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def f(data, weight=0, intercept=0):\n",
- " target_predict = weight * data + intercept\n",
- " return target_predict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
+ "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n",
+ "target *= 100 # rescale the target in k$\n",
+ "data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Compute the mean squared error for this model"
+ "Now it is your turn to train a linear regression model on this dataset. First,\n",
+ "create a linear regression model."
]
},
{
@@ -130,16 +64,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Train a linear regression model on this dataset.\n",
- "\n",
- "\n",
- "
Warning
\n",
- "
In scikit-learn, by convention data (also called X in the scikit-learn\n",
- "documentation) should be a 2D matrix of shape (n_samples, n_features).\n",
- "If data is a 1D vector, you need to reshape it into a matrix with a\n",
- "single column if the vector represents a feature or a single row if the\n",
- "vector represents a sample.
\n",
- "
"
+ "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n",
+ "as metric. Be sure to *return* the fitted *estimators*."
]
},
{
@@ -148,8 +74,6 @@
"metadata": {},
"outputs": [],
"source": [
- "from sklearn.linear_model import LinearRegression\n",
- "\n",
"# Write your code here."
]
},
@@ -157,8 +81,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Compute predictions from the linear regression model and plot both the data\n",
- "and the predictions."
+ "Compute the mean and std of the MAE in thousands of dollars (k$)."
]
},
{
@@ -172,9 +95,15 @@
},
{
"cell_type": "markdown",
- "metadata": {},
+ "metadata": {
+ "lines_to_next_cell": 2
+ },
"source": [
- "Compute the mean squared error"
+ "Inspect the fitted model using a box plot to show the distribution of values\n",
+ "for the coefficients returned from the cross-validation. Hint: use the\n",
+ "function\n",
+ "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n",
+ "to create a box plot."
]
},
{
diff --git a/notebooks/linear_models_ex_03.ipynb b/notebooks/linear_models_ex_03.ipynb
deleted file mode 100644
index 4cf750e81..000000000
--- a/notebooks/linear_models_ex_03.ipynb
+++ /dev/null
@@ -1,130 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# \ud83d\udcdd Exercise M4.03\n",
- "\n",
- "In all previous notebooks, we only used a single feature in `data`. But we\n",
- "have already shown that we could add new features to make the model more\n",
- "expressive by deriving new features, based on the original feature.\n",
- "\n",
- "The aim of this notebook is to train a linear regression algorithm on a\n",
- "dataset with more than a single feature.\n",
- "\n",
- "We will load a dataset about house prices in California. The dataset consists\n",
- "of 8 features regarding the demography and geography of districts in\n",
- "California and the aim is to predict the median house price of each district.\n",
- "We will use all 8 features to predict the target, the median house price."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "
Note
\n",
- "
If you want a deeper overview regarding this dataset, you can refer to the\n",
- "Appendix - Datasets description section at the end of this MOOC.
\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.datasets import fetch_california_housing\n",
- "\n",
- "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n",
- "target *= 100 # rescale the target in k$\n",
- "data.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now it is your turn to train a linear regression model on this dataset. First,\n",
- "create a linear regression model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n",
- "as metric. Be sure to *return* the fitted *estimators*."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Compute the mean and std of the MAE in thousands of dollars (k$)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "lines_to_next_cell": 2
- },
- "source": [
- "Inspect the fitted model using a box plot to show the distribution of values\n",
- "for the coefficients returned from the cross-validation. Hint: use the\n",
- "function\n",
- "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n",
- "to create a box plot."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write your code here."
- ]
- }
- ],
- "metadata": {
- "jupytext": {
- "main_language": "python"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
diff --git a/notebooks/linear_models_sol_02.ipynb b/notebooks/linear_models_sol_02.ipynb
index d56864c4e..634c43171 100644
--- a/notebooks/linear_models_sol_02.ipynb
+++ b/notebooks/linear_models_sol_02.ipynb
@@ -4,39 +4,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# \ud83d\udcc3 Solution for Exercise M4.02\n",
+ "# \ud83d\udcc3 Solution for Exercise M4.03\n",
"\n",
- "The goal of this exercise is to build an intuition on what will be the\n",
- "parameters' values of a linear model when the link between the data and the\n",
- "target is non-linear.\n",
+ "In all previous notebooks, we only used a single feature in `data`. But we\n",
+ "have already shown that we could add new features to make the model more\n",
+ "expressive by deriving new features, based on the original feature.\n",
"\n",
- "First, we will generate such non-linear data.\n",
+ "The aim of this notebook is to train a linear regression algorithm on a\n",
+ "dataset with more than a single feature.\n",
"\n",
- "\n",
- "
Tip
\n",
- "
np.random.RandomState allows to create a random number generator which can\n",
- "be later used to get deterministic results.
\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "\n",
- "# Set the seed for reproduction\n",
- "rng = np.random.RandomState(0)\n",
- "\n",
- "# Generate data\n",
- "n_sample = 100\n",
- "data_max, data_min = 1.4, -1.4\n",
- "len_data = data_max - data_min\n",
- "data = rng.rand(n_sample) * len_data - len_data / 2\n",
- "noise = rng.randn(n_sample) * 0.3\n",
- "target = data**3 - 0.5 * data**2 + noise"
+ "We will load a dataset about house prices in California. The dataset consists\n",
+ "of 8 features regarding the demography and geography of districts in\n",
+ "California and the aim is to predict the median house price of each district.\n",
+ "We will use all 8 features to predict the target, the median house price."
]
},
{
@@ -45,8 +25,8 @@
"source": [
"\n",
"
Note
\n",
- "
To ease the plotting, we will create a Pandas dataframe containing the data\n",
- "and target
\n",
+ "
If you want a deeper overview regarding this dataset, you can refer to the\n",
+ "Appendix - Datasets description section at the end of this MOOC.
\n",
"
"
]
},
@@ -56,49 +36,19 @@
"metadata": {},
"outputs": [],
"source": [
- "import pandas as pd\n",
- "\n",
- "full_data = pd.DataFrame({\"data\": data, \"target\": target})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import seaborn as sns\n",
+ "from sklearn.datasets import fetch_california_housing\n",
"\n",
- "_ = sns.scatterplot(\n",
- " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n",
- ")"
+ "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n",
+ "target *= 100 # rescale the target in k$\n",
+ "data.head()"
]
},
{
"cell_type": "markdown",
- "metadata": {
- "lines_to_next_cell": 2
- },
- "source": [
- "We observe that the link between the data `data` and vector `target` is\n",
- "non-linear. For instance, `data` could represent the years of experience\n",
- "(normalized) and `target` the salary (normalized). Therefore, the problem here\n",
- "would be to infer the salary given the years of experience.\n",
- "\n",
- "Using the function `f` defined below, find both the `weight` and the\n",
- "`intercept` that you think will lead to a good linear model. Plot both the\n",
- "data and the predictions of this model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
"metadata": {},
- "outputs": [],
"source": [
- "def f(data, weight=0, intercept=0):\n",
- " target_predict = weight * data + intercept\n",
- " return target_predict"
+ "Now it is your turn to train a linear regression model on this dataset. First,\n",
+ "create a linear regression model."
]
},
{
@@ -108,30 +58,17 @@
"outputs": [],
"source": [
"# solution\n",
- "predictions = f(data, weight=1.2, intercept=-0.2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "ax = sns.scatterplot(\n",
- " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n",
- ")\n",
- "_ = ax.plot(data, predictions)"
+ "from sklearn.linear_model import LinearRegression\n",
+ "\n",
+ "linear_regression = LinearRegression()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Compute the mean squared error for this model"
+ "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n",
+ "as metric. Be sure to *return* the fitted *estimators*."
]
},
{
@@ -141,26 +78,24 @@
"outputs": [],
"source": [
"# solution\n",
- "from sklearn.metrics import mean_squared_error\n",
+ "from sklearn.model_selection import cross_validate\n",
"\n",
- "error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2))\n",
- "print(f\"The MSE is {error}\")"
+ "cv_results = cross_validate(\n",
+ " linear_regression,\n",
+ " data,\n",
+ " target,\n",
+ " scoring=\"neg_mean_absolute_error\",\n",
+ " return_estimator=True,\n",
+ " cv=10,\n",
+ " n_jobs=2,\n",
+ ")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Train a linear regression model on this dataset.\n",
- "\n",
- "\n",
- "
Warning
\n",
- "
In scikit-learn, by convention data (also called X in the scikit-learn\n",
- "documentation) should be a 2D matrix of shape (n_samples, n_features).\n",
- "If data is a 1D vector, you need to reshape it into a matrix with a\n",
- "single column if the vector represents a feature or a single row if the\n",
- "vector represents a sample.
\n",
- "
"
+ "Compute the mean and std of the MAE in thousands of dollars (k$)."
]
},
{
@@ -169,20 +104,25 @@
"metadata": {},
"outputs": [],
"source": [
- "from sklearn.linear_model import LinearRegression\n",
- "\n",
"# solution\n",
- "linear_regression = LinearRegression()\n",
- "data_2d = data.reshape(-1, 1)\n",
- "linear_regression.fit(data_2d, target)"
+ "print(\n",
+ " \"Mean absolute error on testing set: \"\n",
+ " f\"{-cv_results['test_score'].mean():.3f} k$ \u00b1 \"\n",
+ " f\"{cv_results['test_score'].std():.3f}\"\n",
+ ")"
]
},
{
"cell_type": "markdown",
- "metadata": {},
+ "metadata": {
+ "lines_to_next_cell": 2
+ },
"source": [
- "Compute predictions from the linear regression model and plot both the data\n",
- "and the predictions."
+ "Inspect the fitted model using a box plot to show the distribution of values\n",
+ "for the coefficients returned from the cross-validation. Hint: use the\n",
+ "function\n",
+ "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n",
+ "to create a box plot."
]
},
{
@@ -192,7 +132,11 @@
"outputs": [],
"source": [
"# solution\n",
- "predictions = linear_regression.predict(data_2d)"
+ "import pandas as pd\n",
+ "\n",
+ "weights = pd.DataFrame(\n",
+ " [est.coef_ for est in cv_results[\"estimator\"]], columns=data.columns\n",
+ ")"
]
},
{
@@ -205,28 +149,11 @@
},
"outputs": [],
"source": [
- "ax = sns.scatterplot(\n",
- " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n",
- ")\n",
- "_ = ax.plot(data, predictions)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Compute the mean squared error"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "error = mean_squared_error(target, predictions)\n",
- "print(f\"The MSE is {error}\")"
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n",
+ "weights.plot.box(color=color, vert=False)\n",
+ "_ = plt.title(\"Value of linear regression coefficients\")"
]
}
],
diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb
deleted file mode 100644
index 634c43171..000000000
--- a/notebooks/linear_models_sol_03.ipynb
+++ /dev/null
@@ -1,171 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# \ud83d\udcc3 Solution for Exercise M4.03\n",
- "\n",
- "In all previous notebooks, we only used a single feature in `data`. But we\n",
- "have already shown that we could add new features to make the model more\n",
- "expressive by deriving new features, based on the original feature.\n",
- "\n",
- "The aim of this notebook is to train a linear regression algorithm on a\n",
- "dataset with more than a single feature.\n",
- "\n",
- "We will load a dataset about house prices in California. The dataset consists\n",
- "of 8 features regarding the demography and geography of districts in\n",
- "California and the aim is to predict the median house price of each district.\n",
- "We will use all 8 features to predict the target, the median house price."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "
Note
\n",
- "
If you want a deeper overview regarding this dataset, you can refer to the\n",
- "Appendix - Datasets description section at the end of this MOOC.
\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.datasets import fetch_california_housing\n",
- "\n",
- "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n",
- "target *= 100 # rescale the target in k$\n",
- "data.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now it is your turn to train a linear regression model on this dataset. First,\n",
- "create a linear regression model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "from sklearn.linear_model import LinearRegression\n",
- "\n",
- "linear_regression = LinearRegression()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n",
- "as metric. Be sure to *return* the fitted *estimators*."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "from sklearn.model_selection import cross_validate\n",
- "\n",
- "cv_results = cross_validate(\n",
- " linear_regression,\n",
- " data,\n",
- " target,\n",
- " scoring=\"neg_mean_absolute_error\",\n",
- " return_estimator=True,\n",
- " cv=10,\n",
- " n_jobs=2,\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Compute the mean and std of the MAE in thousands of dollars (k$)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "print(\n",
- " \"Mean absolute error on testing set: \"\n",
- " f\"{-cv_results['test_score'].mean():.3f} k$ \u00b1 \"\n",
- " f\"{cv_results['test_score'].std():.3f}\"\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "lines_to_next_cell": 2
- },
- "source": [
- "Inspect the fitted model using a box plot to show the distribution of values\n",
- "for the coefficients returned from the cross-validation. Hint: use the\n",
- "function\n",
- "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n",
- "to create a box plot."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# solution\n",
- "import pandas as pd\n",
- "\n",
- "weights = pd.DataFrame(\n",
- " [est.coef_ for est in cv_results[\"estimator\"]], columns=data.columns\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": [
- "solution"
- ]
- },
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt\n",
- "\n",
- "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n",
- "weights.plot.box(color=color, vert=False)\n",
- "_ = plt.title(\"Value of linear regression coefficients\")"
- ]
- }
- ],
- "metadata": {
- "jupytext": {
- "main_language": "python"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
diff --git a/python_scripts/linear_models_ex_02.py b/python_scripts/linear_models_ex_02.py
index 640c44046..f58a1f0fe 100644
--- a/python_scripts/linear_models_ex_02.py
+++ b/python_scripts/linear_models_ex_02.py
@@ -14,100 +14,80 @@
# %% [markdown]
# # 📝 Exercise M4.02
#
-# The goal of this exercise is to build an intuition on what will be the
-# parameters' values of a linear model when the link between the data and the
-# target is non-linear.
+# In the previous notebook, we showed that we can add new features based on the
+# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
+# In that case we only used a single feature in `data`.
#
-# First, we will generate such non-linear data.
+# The aim of this notebook is to train a linear regression algorithm on a
+# dataset with more than a single feature. In such a "multi-dimensional" feature
+# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
+# etc. Products of features are usually called "non-linear or
+# multiplicative interactions" between features.
#
-# ```{tip}
-# `np.random.RandomState` allows to create a random number generator which can
-# be later used to get deterministic results.
-# ```
-
-# %%
-import numpy as np
-
-# Set the seed for reproduction
-rng = np.random.RandomState(0)
-
-# Generate data
-n_sample = 100
-data_max, data_min = 1.4, -1.4
-len_data = data_max - data_min
-data = rng.rand(n_sample) * len_data - len_data / 2
-noise = rng.randn(n_sample) * 0.3
-target = data**3 - 0.5 * data**2 + noise
+# Feature engineering can be an important step of a model pipeline as long as
+# the new features are expected to be predictive. For instance, think of a
+# classification model to decide if a patient has risk of developing a heart
+# disease. This would depend on the patient's Body Mass Index which is defined
+# as `weight / height ** 2`.
+#
+# We load the dataset penguins dataset. We first use a set of 3 numerical
+# features to predict the target, i.e. the body mass of the penguin.
# %% [markdown]
# ```{note}
-# To ease the plotting, we will create a Pandas dataframe containing the data
-# and target
+# If you want a deeper overview regarding this dataset, you can refer to the
+# Appendix - Datasets description section at the end of this MOOC.
# ```
# %%
import pandas as pd
-full_data = pd.DataFrame({"data": data, "target": target})
+penguins = pd.read_csv("../datasets/penguins.csv")
-# %%
-import seaborn as sns
+columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
+target_name = "Body Mass (g)"
-_ = sns.scatterplot(
- data=full_data, x="data", y="target", color="black", alpha=0.5
-)
+# Remove lines with missing values for the columns of interest
+penguins_non_missing = penguins[columns + [target_name]].dropna()
-# %% [markdown]
-# We observe that the link between the data `data` and vector `target` is
-# non-linear. For instance, `data` could represent the years of experience
-# (normalized) and `target` the salary (normalized). Therefore, the problem here
-# would be to infer the salary given the years of experience.
-#
-# Using the function `f` defined below, find both the `weight` and the
-# `intercept` that you think will lead to a good linear model. Plot both the
-# data and the predictions of this model.
-
-
-# %%
-def f(data, weight=0, intercept=0):
- target_predict = weight * data + intercept
- return target_predict
+data = penguins_non_missing[columns]
+target = penguins_non_missing[target_name]
+data.head()
+# %% [markdown]
+# Now it is your turn to train a linear regression model on this dataset. First,
+# create a linear regression model.
# %%
# Write your code here.
# %% [markdown]
-# Compute the mean squared error for this model
+# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
+# as metric.
# %%
# Write your code here.
# %% [markdown]
-# Train a linear regression model on this dataset.
-#
-# ```{warning}
-# In scikit-learn, by convention `data` (also called `X` in the scikit-learn
-# documentation) should be a 2D matrix of shape `(n_samples, n_features)`.
-# If `data` is a 1D vector, you need to reshape it into a matrix with a
-# single column if the vector represents a feature or a single row if the
-# vector represents a sample.
-# ```
+# Compute the mean and std of the MAE in grams (g).
# %%
-from sklearn.linear_model import LinearRegression
-
# Write your code here.
# %% [markdown]
-# Compute predictions from the linear regression model and plot both the data
-# and the predictions.
+# Now create a pipeline using `make_pipeline` consisting of a
+# `PolynomialFeatures` and a linear regression. Set `degree=2` and
+# `interaction_only=True` to the feature engineering step. Remember not to
+# include the bias to avoid redundancies with the linear's regression intercept.
+#
+# Use the same strategy as before to cross-validate such a pipeline.
# %%
# Write your code here.
# %% [markdown]
-# Compute the mean squared error
+# Compute the mean and std of the MAE in grams (g) and compare with the results
+# without feature engineering.
# %%
# Write your code here.
diff --git a/python_scripts/linear_models_ex_03.py b/python_scripts/linear_models_ex_03.py
index 3ab6949a3..9c311e817 100644
--- a/python_scripts/linear_models_ex_03.py
+++ b/python_scripts/linear_models_ex_03.py
@@ -14,24 +14,14 @@
# %% [markdown]
# # 📝 Exercise M4.03
#
-# In the previous notebook, we showed that we can add new features based on the
-# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
-# In that case we only used a single feature in `data`.
+# The parameter `penalty` can control the **type** of regularization to use,
+# whereas the regularization **strength** is set using the parameter `C`.
+# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
+# this exercise, we ask you to train a logistic regression classifier using the
+# `penalty="l2"` regularization (which happens to be the default in
+# scikit-learn) to find by yourself the effect of the parameter `C`.
#
-# The aim of this notebook is to train a linear regression algorithm on a
-# dataset with more than a single feature. In such a "multi-dimensional" feature
-# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
-# etc. Products of features are usually called "non-linear or
-# multiplicative interactions" between features.
-#
-# Feature engineering can be an important step of a model pipeline as long as
-# the new features are expected to be predictive. For instance, think of a
-# classification model to decide if a patient has risk of developing a heart
-# disease. This would depend on the patient's Body Mass Index which is defined
-# as `weight / height ** 2`.
-#
-# We load the dataset penguins dataset. We first use a set of 3 numerical
-# features to predict the target, i.e. the body mass of the penguin.
+# We start by loading the dataset.
# %% [markdown]
# ```{note}
@@ -42,52 +32,51 @@
# %%
import pandas as pd
-penguins = pd.read_csv("../datasets/penguins.csv")
+penguins = pd.read_csv("../datasets/penguins_classification.csv")
+# only keep the Adelie and Chinstrap classes
+penguins = (
+ penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
+)
-columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
-target_name = "Body Mass (g)"
+culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
+target_column = "Species"
-# Remove lines with missing values for the columns of interest
-penguins_non_missing = penguins[columns + [target_name]].dropna()
+# %%
+from sklearn.model_selection import train_test_split
-data = penguins_non_missing[columns]
-target = penguins_non_missing[target_name]
-data.head()
+penguins_train, penguins_test = train_test_split(penguins, random_state=0)
-# %% [markdown]
-# Now it is your turn to train a linear regression model on this dataset. First,
-# create a linear regression model.
+data_train = penguins_train[culmen_columns]
+data_test = penguins_test[culmen_columns]
-# %%
-# Write your code here.
+target_train = penguins_train[target_column]
+target_test = penguins_test[target_column]
# %% [markdown]
-# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
-# as metric.
+# First, let's create our predictive model.
# %%
-# Write your code here.
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression
-# %% [markdown]
-# Compute the mean and std of the MAE in grams (g).
-
-# %%
-# Write your code here.
+logistic_regression = make_pipeline(
+ StandardScaler(), LogisticRegression(penalty="l2")
+)
# %% [markdown]
-# Now create a pipeline using `make_pipeline` consisting of a
-# `PolynomialFeatures` and a linear regression. Set `degree=2` and
-# `interaction_only=True` to the feature engineering step. Remember not to
-# include the bias to avoid redundancies with the linear's regression intercept.
-#
-# Use the same strategy as before to cross-validate such a pipeline.
+# Given the following candidates for the `C` parameter, find out the impact of
+# `C` on the classifier decision boundary. You can use
+# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
+# decision function boundary.
# %%
+Cs = [0.01, 0.1, 1, 10]
+
# Write your code here.
# %% [markdown]
-# Compute the mean and std of the MAE in grams (g) and compare with the results
-# without feature engineering.
+# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
# %%
# Write your code here.
diff --git a/python_scripts/linear_models_ex_04.py b/python_scripts/linear_models_ex_04.py
deleted file mode 100644
index ef365713a..000000000
--- a/python_scripts/linear_models_ex_04.py
+++ /dev/null
@@ -1,82 +0,0 @@
-# ---
-# jupyter:
-# jupytext:
-# text_representation:
-# extension: .py
-# format_name: percent
-# format_version: '1.3'
-# jupytext_version: 1.14.5
-# kernelspec:
-# display_name: Python 3
-# name: python3
-# ---
-
-# %% [markdown]
-# # 📝 Exercise M4.04
-#
-# The parameter `penalty` can control the **type** of regularization to use,
-# whereas the regularization **strength** is set using the parameter `C`.
-# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
-# this exercise, we ask you to train a logistic regression classifier using the
-# `penalty="l2"` regularization (which happens to be the default in
-# scikit-learn) to find by yourself the effect of the parameter `C`.
-#
-# We will start by loading the dataset.
-
-# %% [markdown]
-# ```{note}
-# If you want a deeper overview regarding this dataset, you can refer to the
-# Appendix - Datasets description section at the end of this MOOC.
-# ```
-
-# %%
-import pandas as pd
-
-penguins = pd.read_csv("../datasets/penguins_classification.csv")
-# only keep the Adelie and Chinstrap classes
-penguins = (
- penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
-)
-
-culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
-target_column = "Species"
-
-# %%
-from sklearn.model_selection import train_test_split
-
-penguins_train, penguins_test = train_test_split(penguins, random_state=0)
-
-data_train = penguins_train[culmen_columns]
-data_test = penguins_test[culmen_columns]
-
-target_train = penguins_train[target_column]
-target_test = penguins_test[target_column]
-
-# %% [markdown]
-# First, let's create our predictive model.
-
-# %%
-from sklearn.pipeline import make_pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.linear_model import LogisticRegression
-
-logistic_regression = make_pipeline(
- StandardScaler(), LogisticRegression(penalty="l2")
-)
-
-# %% [markdown]
-# Given the following candidates for the `C` parameter, find out the impact of
-# `C` on the classifier decision boundary. You can use
-# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
-# decision function boundary.
-
-# %%
-Cs = [0.01, 0.1, 1, 10]
-
-# Write your code here.
-
-# %% [markdown]
-# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
-
-# %%
-# Write your code here.
diff --git a/python_scripts/linear_models_sol_02.py b/python_scripts/linear_models_sol_02.py
index d62a4b983..3abc476da 100644
--- a/python_scripts/linear_models_sol_02.py
+++ b/python_scripts/linear_models_sol_02.py
@@ -8,123 +8,127 @@
# %% [markdown]
# # 📃 Solution for Exercise M4.02
#
-# The goal of this exercise is to build an intuition on what will be the
-# parameters' values of a linear model when the link between the data and the
-# target is non-linear.
+# In the previous notebook, we showed that we can add new features based on the
+# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
+# In that case we only used a single feature in `data`.
#
-# First, we will generate such non-linear data.
+# The aim of this notebook is to train a linear regression algorithm on a
+# dataset with more than a single feature. In such a "multi-dimensional" feature
+# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
+# etc. Products of features are usually called "non-linear or
+# multiplicative interactions" between features.
#
-# ```{tip}
-# `np.random.RandomState` allows to create a random number generator which can
-# be later used to get deterministic results.
-# ```
-
-# %%
-import numpy as np
-
-# Set the seed for reproduction
-rng = np.random.RandomState(0)
-
-# Generate data
-n_sample = 100
-data_max, data_min = 1.4, -1.4
-len_data = data_max - data_min
-data = rng.rand(n_sample) * len_data - len_data / 2
-noise = rng.randn(n_sample) * 0.3
-target = data**3 - 0.5 * data**2 + noise
+# Feature engineering can be an important step of a model pipeline as long as
+# the new features are expected to be predictive. For instance, think of a
+# classification model to decide if a patient has risk of developing a heart
+# disease. This would depend on the patient's Body Mass Index which is defined
+# as `weight / height ** 2`.
+#
+# We load the dataset penguins dataset. We first use a set of 3 numerical
+# features to predict the target, i.e. the body mass of the penguin.
# %% [markdown]
# ```{note}
-# To ease the plotting, we will create a Pandas dataframe containing the data
-# and target
+# If you want a deeper overview regarding this dataset, you can refer to the
+# Appendix - Datasets description section at the end of this MOOC.
# ```
# %%
import pandas as pd
-full_data = pd.DataFrame({"data": data, "target": target})
-
-# %%
-import seaborn as sns
-
-_ = sns.scatterplot(
- data=full_data, x="data", y="target", color="black", alpha=0.5
-)
+penguins = pd.read_csv("../datasets/penguins.csv")
-# %% [markdown]
-# We observe that the link between the data `data` and vector `target` is
-# non-linear. For instance, `data` could represent the years of experience
-# (normalized) and `target` the salary (normalized). Therefore, the problem here
-# would be to infer the salary given the years of experience.
-#
-# Using the function `f` defined below, find both the `weight` and the
-# `intercept` that you think will lead to a good linear model. Plot both the
-# data and the predictions of this model.
+columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
+target_name = "Body Mass (g)"
+# Remove lines with missing values for the columns of interest
+penguins_non_missing = penguins[columns + [target_name]].dropna()
-# %%
-def f(data, weight=0, intercept=0):
- target_predict = weight * data + intercept
- return target_predict
+data = penguins_non_missing[columns]
+target = penguins_non_missing[target_name]
+data.head()
+# %% [markdown]
+# Now it is your turn to train a linear regression model on this dataset. First,
+# create a linear regression model.
# %%
# solution
-predictions = f(data, weight=1.2, intercept=-0.2)
+from sklearn.linear_model import LinearRegression
-# %% tags=["solution"]
-ax = sns.scatterplot(
- data=full_data, x="data", y="target", color="black", alpha=0.5
-)
-_ = ax.plot(data, predictions)
+linear_regression = LinearRegression()
# %% [markdown]
-# Compute the mean squared error for this model
+# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
+# as metric.
# %%
# solution
-from sklearn.metrics import mean_squared_error
-
-error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2))
-print(f"The MSE is {error}")
+from sklearn.model_selection import cross_validate
+
+cv_results = cross_validate(
+ linear_regression,
+ data,
+ target,
+ cv=10,
+ scoring="neg_mean_absolute_error",
+ n_jobs=2,
+)
# %% [markdown]
-# Train a linear regression model on this dataset.
-#
-# ```{warning}
-# In scikit-learn, by convention `data` (also called `X` in the scikit-learn
-# documentation) should be a 2D matrix of shape `(n_samples, n_features)`.
-# If `data` is a 1D vector, you need to reshape it into a matrix with a
-# single column if the vector represents a feature or a single row if the
-# vector represents a sample.
-# ```
+# Compute the mean and std of the MAE in grams (g).
# %%
-from sklearn.linear_model import LinearRegression
-
# solution
-linear_regression = LinearRegression()
-data_2d = data.reshape(-1, 1)
-linear_regression.fit(data_2d, target)
+print(
+ "Mean absolute error on testing set with original features: "
+ f"{-cv_results['test_score'].mean():.3f} ± "
+ f"{cv_results['test_score'].std():.3f} g"
+)
# %% [markdown]
-# Compute predictions from the linear regression model and plot both the data
-# and the predictions.
+# Now create a pipeline using `make_pipeline` consisting of a
+# `PolynomialFeatures` and a linear regression. Set `degree=2` and
+# `interaction_only=True` to the feature engineering step. Remember not to
+# include the bias to avoid redundancies with the linear's regression intercept.
+#
+# Use the same strategy as before to cross-validate such a pipeline.
# %%
# solution
-predictions = linear_regression.predict(data_2d)
+from sklearn.preprocessing import PolynomialFeatures
+from sklearn.pipeline import make_pipeline
-# %% tags=["solution"]
-ax = sns.scatterplot(
- data=full_data, x="data", y="target", color="black", alpha=0.5
+poly_features = PolynomialFeatures(
+ degree=2, include_bias=False, interaction_only=True
+)
+linear_regression_interactions = make_pipeline(
+ poly_features, linear_regression
+)
+
+cv_results = cross_validate(
+ linear_regression_interactions,
+ data,
+ target,
+ cv=10,
+ scoring="neg_mean_absolute_error",
+ n_jobs=2,
)
-_ = ax.plot(data, predictions)
# %% [markdown]
-# Compute the mean squared error
+# Compute the mean and std of the MAE in grams (g) and compare with the results
+# without feature engineering.
# %%
# solution
-error = mean_squared_error(target, predictions)
-print(f"The MSE is {error}")
+print(
+ "Mean absolute error on testing set with interactions: "
+ f"{-cv_results['test_score'].mean():.3f} ± "
+ f"{cv_results['test_score'].std():.3f} g"
+)
+
+# %% [markdown] tags=["solution"]
+# We observe that the mean absolute error is lower and less spread with the
+# enriched features. In this case the "interactions" are indeed predictive. In
+# the following notebook we will see what happens when the enriched features are
+# non-predictive and how to deal with this case.
diff --git a/python_scripts/linear_models_sol_03.py b/python_scripts/linear_models_sol_03.py
index 0cacfcf0d..d789c8522 100644
--- a/python_scripts/linear_models_sol_03.py
+++ b/python_scripts/linear_models_sol_03.py
@@ -8,24 +8,14 @@
# %% [markdown]
# # 📃 Solution for Exercise M4.03
#
-# In the previous notebook, we showed that we can add new features based on the
-# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
-# In that case we only used a single feature in `data`.
+# The parameter `penalty` can control the **type** of regularization to use,
+# whereas the regularization **strength** is set using the parameter `C`.
+# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
+# this exercise, we ask you to train a logistic regression classifier using the
+# `penalty="l2"` regularization (which happens to be the default in
+# scikit-learn) to find by yourself the effect of the parameter `C`.
#
-# The aim of this notebook is to train a linear regression algorithm on a
-# dataset with more than a single feature. In such a "multi-dimensional" feature
-# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
-# etc. Products of features are usually called "non-linear or
-# multiplicative interactions" between features.
-#
-# Feature engineering can be an important step of a model pipeline as long as
-# the new features are expected to be predictive. For instance, think of a
-# classification model to decide if a patient has risk of developing a heart
-# disease. This would depend on the patient's Body Mass Index which is defined
-# as `weight / height ** 2`.
-#
-# We load the dataset penguins dataset. We first use a set of 3 numerical
-# features to predict the target, i.e. the body mass of the penguin.
+# We start by loading the dataset.
# %% [markdown]
# ```{note}
@@ -36,99 +26,97 @@
# %%
import pandas as pd
-penguins = pd.read_csv("../datasets/penguins.csv")
-
-columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
-target_name = "Body Mass (g)"
-
-# Remove lines with missing values for the columns of interest
-penguins_non_missing = penguins[columns + [target_name]].dropna()
-
-data = penguins_non_missing[columns]
-target = penguins_non_missing[target_name]
-data.head()
+penguins = pd.read_csv("../datasets/penguins_classification.csv")
+# only keep the Adelie and Chinstrap classes
+penguins = (
+ penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
+)
-# %% [markdown]
-# Now it is your turn to train a linear regression model on this dataset. First,
-# create a linear regression model.
+culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
+target_column = "Species"
# %%
-# solution
-from sklearn.linear_model import LinearRegression
+from sklearn.model_selection import train_test_split
-linear_regression = LinearRegression()
+penguins_train, penguins_test = train_test_split(penguins, random_state=0)
-# %% [markdown]
-# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
-# as metric.
+data_train = penguins_train[culmen_columns]
+data_test = penguins_test[culmen_columns]
-# %%
-# solution
-from sklearn.model_selection import cross_validate
-
-cv_results = cross_validate(
- linear_regression,
- data,
- target,
- cv=10,
- scoring="neg_mean_absolute_error",
- n_jobs=2,
-)
+target_train = penguins_train[target_column]
+target_test = penguins_test[target_column]
# %% [markdown]
-# Compute the mean and std of the MAE in grams (g).
+# First, let's create our predictive model.
# %%
-# solution
-print(
- "Mean absolute error on testing set with original features: "
- f"{-cv_results['test_score'].mean():.3f} ± "
- f"{cv_results['test_score'].std():.3f} g"
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression
+
+logistic_regression = make_pipeline(
+ StandardScaler(), LogisticRegression(penalty="l2")
)
# %% [markdown]
-# Now create a pipeline using `make_pipeline` consisting of a
-# `PolynomialFeatures` and a linear regression. Set `degree=2` and
-# `interaction_only=True` to the feature engineering step. Remember not to
-# include the bias to avoid redundancies with the linear's regression intercept.
-#
-# Use the same strategy as before to cross-validate such a pipeline.
+# Given the following candidates for the `C` parameter, find out the impact of
+# `C` on the classifier decision boundary. You can use
+# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
+# decision function boundary.
# %%
-# solution
-from sklearn.preprocessing import PolynomialFeatures
-from sklearn.pipeline import make_pipeline
-
-poly_features = PolynomialFeatures(
- degree=2, include_bias=False, interaction_only=True
-)
-linear_regression_interactions = make_pipeline(
- poly_features, linear_regression
-)
+Cs = [0.01, 0.1, 1, 10]
-cv_results = cross_validate(
- linear_regression_interactions,
- data,
- target,
- cv=10,
- scoring="neg_mean_absolute_error",
- n_jobs=2,
-)
+# solution
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.inspection import DecisionBoundaryDisplay
+
+for C in Cs:
+ logistic_regression.set_params(logisticregression__C=C)
+ logistic_regression.fit(data_train, target_train)
+ accuracy = logistic_regression.score(data_test, target_test)
+
+ DecisionBoundaryDisplay.from_estimator(
+ logistic_regression,
+ data_test,
+ response_method="predict",
+ cmap="RdBu_r",
+ alpha=0.5,
+ )
+ sns.scatterplot(
+ data=penguins_test,
+ x=culmen_columns[0],
+ y=culmen_columns[1],
+ hue=target_column,
+ palette=["tab:red", "tab:blue"],
+ )
+ plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
+ plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
# %% [markdown]
-# Compute the mean and std of the MAE in grams (g) and compare with the results
-# without feature engineering.
+# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
# %%
# solution
-print(
- "Mean absolute error on testing set with interactions: "
- f"{-cv_results['test_score'].mean():.3f} ± "
- f"{cv_results['test_score'].std():.3f} g"
-)
+weights_ridge = []
+for C in Cs:
+ logistic_regression.set_params(logisticregression__C=C)
+ logistic_regression.fit(data_train, target_train)
+ coefs = logistic_regression[-1].coef_[0]
+ weights_ridge.append(pd.Series(coefs, index=culmen_columns))
+
+# %% tags=["solution"]
+weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs])
+weights_ridge.plot.barh()
+_ = plt.title("LogisticRegression weights depending of C")
# %% [markdown] tags=["solution"]
-# We observe that the mean absolute error is lower and less spread with the
-# enriched features. In this case the "interactions" are indeed predictive. In
-# the following notebook we will see what happens when the enriched features are
-# non-predictive and how to deal with this case.
+# We see that a small `C` will shrink the weights values toward zero. It means
+# that a small `C` provides a more regularized model. Thus, `C` is the inverse
+# of the `alpha` coefficient in the `Ridge` model.
+#
+# Besides, with a strong penalty (i.e. small `C` value), the weight of the
+# feature "Culmen Depth (mm)" is almost zero. It explains why the decision
+# separation in the plot is almost perpendicular to the "Culmen Length (mm)"
+# feature.
diff --git a/python_scripts/linear_models_sol_04.py b/python_scripts/linear_models_sol_04.py
deleted file mode 100644
index 358abce52..000000000
--- a/python_scripts/linear_models_sol_04.py
+++ /dev/null
@@ -1,122 +0,0 @@
-# ---
-# jupyter:
-# kernelspec:
-# display_name: Python 3
-# name: python3
-# ---
-
-# %% [markdown]
-# # 📃 Solution for Exercise M4.04
-#
-# The parameter `penalty` can control the **type** of regularization to use,
-# whereas the regularization **strength** is set using the parameter `C`.
-# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
-# this exercise, we ask you to train a logistic regression classifier using the
-# `penalty="l2"` regularization (which happens to be the default in
-# scikit-learn) to find by yourself the effect of the parameter `C`.
-#
-# We start by loading the dataset.
-
-# %% [markdown]
-# ```{note}
-# If you want a deeper overview regarding this dataset, you can refer to the
-# Appendix - Datasets description section at the end of this MOOC.
-# ```
-
-# %%
-import pandas as pd
-
-penguins = pd.read_csv("../datasets/penguins_classification.csv")
-# only keep the Adelie and Chinstrap classes
-penguins = (
- penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
-)
-
-culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
-target_column = "Species"
-
-# %%
-from sklearn.model_selection import train_test_split
-
-penguins_train, penguins_test = train_test_split(penguins, random_state=0)
-
-data_train = penguins_train[culmen_columns]
-data_test = penguins_test[culmen_columns]
-
-target_train = penguins_train[target_column]
-target_test = penguins_test[target_column]
-
-# %% [markdown]
-# First, let's create our predictive model.
-
-# %%
-from sklearn.pipeline import make_pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.linear_model import LogisticRegression
-
-logistic_regression = make_pipeline(
- StandardScaler(), LogisticRegression(penalty="l2")
-)
-
-# %% [markdown]
-# Given the following candidates for the `C` parameter, find out the impact of
-# `C` on the classifier decision boundary. You can use
-# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
-# decision function boundary.
-
-# %%
-Cs = [0.01, 0.1, 1, 10]
-
-# solution
-import matplotlib.pyplot as plt
-import seaborn as sns
-from sklearn.inspection import DecisionBoundaryDisplay
-
-for C in Cs:
- logistic_regression.set_params(logisticregression__C=C)
- logistic_regression.fit(data_train, target_train)
- accuracy = logistic_regression.score(data_test, target_test)
-
- DecisionBoundaryDisplay.from_estimator(
- logistic_regression,
- data_test,
- response_method="predict",
- cmap="RdBu_r",
- alpha=0.5,
- )
- sns.scatterplot(
- data=penguins_test,
- x=culmen_columns[0],
- y=culmen_columns[1],
- hue=target_column,
- palette=["tab:red", "tab:blue"],
- )
- plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
- plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
-
-# %% [markdown]
-# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
-
-# %%
-# solution
-weights_ridge = []
-for C in Cs:
- logistic_regression.set_params(logisticregression__C=C)
- logistic_regression.fit(data_train, target_train)
- coefs = logistic_regression[-1].coef_[0]
- weights_ridge.append(pd.Series(coefs, index=culmen_columns))
-
-# %% tags=["solution"]
-weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs])
-weights_ridge.plot.barh()
-_ = plt.title("LogisticRegression weights depending of C")
-
-# %% [markdown] tags=["solution"]
-# We see that a small `C` will shrink the weights values toward zero. It means
-# that a small `C` provides a more regularized model. Thus, `C` is the inverse
-# of the `alpha` coefficient in the `Ridge` model.
-#
-# Besides, with a strong penalty (i.e. small `C` value), the weight of the
-# feature "Culmen Depth (mm)" is almost zero. It explains why the decision
-# separation in the plot is almost perpendicular to the "Culmen Length (mm)"
-# feature.