diff --git a/jupyter-book/_toc.yml b/jupyter-book/_toc.yml index 5aea17ca2..80bb88aa3 100644 --- a/jupyter-book/_toc.yml +++ b/jupyter-book/_toc.yml @@ -99,11 +99,9 @@ parts: - file: linear_models/linear_models_quiz_m4_02 - file: linear_models/linear_models_non_linear_index sections: + - file: python_scripts/linear_regression_non_linear_link - file: python_scripts/linear_models_ex_02 - file: python_scripts/linear_models_sol_02 - - file: python_scripts/linear_regression_non_linear_link - - file: python_scripts/linear_models_ex_03 - - file: python_scripts/linear_models_sol_03 - file: python_scripts/logistic_regression_non_linear - file: linear_models/linear_models_quiz_m4_03 - file: linear_models/linear_models_regularization_index @@ -111,8 +109,8 @@ parts: - file: linear_models/regularized_linear_models_slides - file: python_scripts/linear_models_regularization - file: linear_models/linear_models_quiz_m4_04 - - file: python_scripts/linear_models_ex_04 - - file: python_scripts/linear_models_sol_04 + - file: python_scripts/linear_models_ex_03 + - file: python_scripts/linear_models_sol_03 - file: linear_models/linear_models_quiz_m4_05 - file: linear_models/linear_models_wrap_up_quiz - file: linear_models/linear_models_module_take_away diff --git a/notebooks/linear_models_ex_02.ipynb b/notebooks/linear_models_ex_02.ipynb index c9c0aad96..4cf750e81 100644 --- a/notebooks/linear_models_ex_02.ipynb +++ b/notebooks/linear_models_ex_02.ipynb @@ -4,39 +4,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# \ud83d\udcdd Exercise M4.02\n", + "# \ud83d\udcdd Exercise M4.03\n", "\n", - "The goal of this exercise is to build an intuition on what will be the\n", - "parameters' values of a linear model when the link between the data and the\n", - "target is non-linear.\n", + "In all previous notebooks, we only used a single feature in `data`. But we\n", + "have already shown that we could add new features to make the model more\n", + "expressive by deriving new features, based on the original feature.\n", "\n", - "First, we will generate such non-linear data.\n", + "The aim of this notebook is to train a linear regression algorithm on a\n", + "dataset with more than a single feature.\n", "\n", - "
\n", - "

Tip

\n", - "

np.random.RandomState allows to create a random number generator which can\n", - "be later used to get deterministic results.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "# Set the seed for reproduction\n", - "rng = np.random.RandomState(0)\n", - "\n", - "# Generate data\n", - "n_sample = 100\n", - "data_max, data_min = 1.4, -1.4\n", - "len_data = data_max - data_min\n", - "data = rng.rand(n_sample) * len_data - len_data / 2\n", - "noise = rng.randn(n_sample) * 0.3\n", - "target = data**3 - 0.5 * data**2 + noise" + "We will load a dataset about house prices in California. The dataset consists\n", + "of 8 features regarding the demography and geography of districts in\n", + "California and the aim is to predict the median house price of each district.\n", + "We will use all 8 features to predict the target, the median house price." ] }, { @@ -45,8 +25,8 @@ "source": [ "
\n", "

Note

\n", - "

To ease the plotting, we will create a Pandas dataframe containing the data\n", - "and target

\n", + "

If you want a deeper overview regarding this dataset, you can refer to the\n", + "Appendix - Datasets description section at the end of this MOOC.

\n", "
" ] }, @@ -56,65 +36,19 @@ "metadata": {}, "outputs": [], "source": [ - "import pandas as pd\n", + "from sklearn.datasets import fetch_california_housing\n", "\n", - "full_data = pd.DataFrame({\"data\": data, \"target\": target})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import seaborn as sns\n", - "\n", - "_ = sns.scatterplot(\n", - " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "We observe that the link between the data `data` and vector `target` is\n", - "non-linear. For instance, `data` could represent the years of experience\n", - "(normalized) and `target` the salary (normalized). Therefore, the problem here\n", - "would be to infer the salary given the years of experience.\n", - "\n", - "Using the function `f` defined below, find both the `weight` and the\n", - "`intercept` that you think will lead to a good linear model. Plot both the\n", - "data and the predictions of this model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def f(data, weight=0, intercept=0):\n", - " target_predict = weight * data + intercept\n", - " return target_predict" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." + "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n", + "target *= 100 # rescale the target in k$\n", + "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Compute the mean squared error for this model" + "Now it is your turn to train a linear regression model on this dataset. First,\n", + "create a linear regression model." ] }, { @@ -130,16 +64,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Train a linear regression model on this dataset.\n", - "\n", - "
\n", - "

Warning

\n", - "

In scikit-learn, by convention data (also called X in the scikit-learn\n", - "documentation) should be a 2D matrix of shape (n_samples, n_features).\n", - "If data is a 1D vector, you need to reshape it into a matrix with a\n", - "single column if the vector represents a feature or a single row if the\n", - "vector represents a sample.

\n", - "
" + "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n", + "as metric. Be sure to *return* the fitted *estimators*." ] }, { @@ -148,8 +74,6 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.linear_model import LinearRegression\n", - "\n", "# Write your code here." ] }, @@ -157,8 +81,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Compute predictions from the linear regression model and plot both the data\n", - "and the predictions." + "Compute the mean and std of the MAE in thousands of dollars (k$)." ] }, { @@ -172,9 +95,15 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "lines_to_next_cell": 2 + }, "source": [ - "Compute the mean squared error" + "Inspect the fitted model using a box plot to show the distribution of values\n", + "for the coefficients returned from the cross-validation. Hint: use the\n", + "function\n", + "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n", + "to create a box plot." ] }, { diff --git a/notebooks/linear_models_ex_03.ipynb b/notebooks/linear_models_ex_03.ipynb deleted file mode 100644 index 4cf750e81..000000000 --- a/notebooks/linear_models_ex_03.ipynb +++ /dev/null @@ -1,130 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# \ud83d\udcdd Exercise M4.03\n", - "\n", - "In all previous notebooks, we only used a single feature in `data`. But we\n", - "have already shown that we could add new features to make the model more\n", - "expressive by deriving new features, based on the original feature.\n", - "\n", - "The aim of this notebook is to train a linear regression algorithm on a\n", - "dataset with more than a single feature.\n", - "\n", - "We will load a dataset about house prices in California. The dataset consists\n", - "of 8 features regarding the demography and geography of districts in\n", - "California and the aim is to predict the median house price of each district.\n", - "We will use all 8 features to predict the target, the median house price." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "

Note

\n", - "

If you want a deeper overview regarding this dataset, you can refer to the\n", - "Appendix - Datasets description section at the end of this MOOC.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import fetch_california_housing\n", - "\n", - "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n", - "target *= 100 # rescale the target in k$\n", - "data.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now it is your turn to train a linear regression model on this dataset. First,\n", - "create a linear regression model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n", - "as metric. Be sure to *return* the fitted *estimators*." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Compute the mean and std of the MAE in thousands of dollars (k$)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "Inspect the fitted model using a box plot to show the distribution of values\n", - "for the coefficients returned from the cross-validation. Hint: use the\n", - "function\n", - "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n", - "to create a box plot." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - } - ], - "metadata": { - "jupytext": { - "main_language": "python" - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/notebooks/linear_models_sol_02.ipynb b/notebooks/linear_models_sol_02.ipynb index d56864c4e..634c43171 100644 --- a/notebooks/linear_models_sol_02.ipynb +++ b/notebooks/linear_models_sol_02.ipynb @@ -4,39 +4,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# \ud83d\udcc3 Solution for Exercise M4.02\n", + "# \ud83d\udcc3 Solution for Exercise M4.03\n", "\n", - "The goal of this exercise is to build an intuition on what will be the\n", - "parameters' values of a linear model when the link between the data and the\n", - "target is non-linear.\n", + "In all previous notebooks, we only used a single feature in `data`. But we\n", + "have already shown that we could add new features to make the model more\n", + "expressive by deriving new features, based on the original feature.\n", "\n", - "First, we will generate such non-linear data.\n", + "The aim of this notebook is to train a linear regression algorithm on a\n", + "dataset with more than a single feature.\n", "\n", - "
\n", - "

Tip

\n", - "

np.random.RandomState allows to create a random number generator which can\n", - "be later used to get deterministic results.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "# Set the seed for reproduction\n", - "rng = np.random.RandomState(0)\n", - "\n", - "# Generate data\n", - "n_sample = 100\n", - "data_max, data_min = 1.4, -1.4\n", - "len_data = data_max - data_min\n", - "data = rng.rand(n_sample) * len_data - len_data / 2\n", - "noise = rng.randn(n_sample) * 0.3\n", - "target = data**3 - 0.5 * data**2 + noise" + "We will load a dataset about house prices in California. The dataset consists\n", + "of 8 features regarding the demography and geography of districts in\n", + "California and the aim is to predict the median house price of each district.\n", + "We will use all 8 features to predict the target, the median house price." ] }, { @@ -45,8 +25,8 @@ "source": [ "
\n", "

Note

\n", - "

To ease the plotting, we will create a Pandas dataframe containing the data\n", - "and target

\n", + "

If you want a deeper overview regarding this dataset, you can refer to the\n", + "Appendix - Datasets description section at the end of this MOOC.

\n", "
" ] }, @@ -56,49 +36,19 @@ "metadata": {}, "outputs": [], "source": [ - "import pandas as pd\n", - "\n", - "full_data = pd.DataFrame({\"data\": data, \"target\": target})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import seaborn as sns\n", + "from sklearn.datasets import fetch_california_housing\n", "\n", - "_ = sns.scatterplot(\n", - " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n", - ")" + "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n", + "target *= 100 # rescale the target in k$\n", + "data.head()" ] }, { "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "We observe that the link between the data `data` and vector `target` is\n", - "non-linear. For instance, `data` could represent the years of experience\n", - "(normalized) and `target` the salary (normalized). Therefore, the problem here\n", - "would be to infer the salary given the years of experience.\n", - "\n", - "Using the function `f` defined below, find both the `weight` and the\n", - "`intercept` that you think will lead to a good linear model. Plot both the\n", - "data and the predictions of this model." - ] - }, - { - "cell_type": "code", - "execution_count": null, "metadata": {}, - "outputs": [], "source": [ - "def f(data, weight=0, intercept=0):\n", - " target_predict = weight * data + intercept\n", - " return target_predict" + "Now it is your turn to train a linear regression model on this dataset. First,\n", + "create a linear regression model." ] }, { @@ -108,30 +58,17 @@ "outputs": [], "source": [ "# solution\n", - "predictions = f(data, weight=1.2, intercept=-0.2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "ax = sns.scatterplot(\n", - " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "_ = ax.plot(data, predictions)" + "from sklearn.linear_model import LinearRegression\n", + "\n", + "linear_regression = LinearRegression()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Compute the mean squared error for this model" + "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n", + "as metric. Be sure to *return* the fitted *estimators*." ] }, { @@ -141,26 +78,24 @@ "outputs": [], "source": [ "# solution\n", - "from sklearn.metrics import mean_squared_error\n", + "from sklearn.model_selection import cross_validate\n", "\n", - "error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2))\n", - "print(f\"The MSE is {error}\")" + "cv_results = cross_validate(\n", + " linear_regression,\n", + " data,\n", + " target,\n", + " scoring=\"neg_mean_absolute_error\",\n", + " return_estimator=True,\n", + " cv=10,\n", + " n_jobs=2,\n", + ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Train a linear regression model on this dataset.\n", - "\n", - "
\n", - "

Warning

\n", - "

In scikit-learn, by convention data (also called X in the scikit-learn\n", - "documentation) should be a 2D matrix of shape (n_samples, n_features).\n", - "If data is a 1D vector, you need to reshape it into a matrix with a\n", - "single column if the vector represents a feature or a single row if the\n", - "vector represents a sample.

\n", - "
" + "Compute the mean and std of the MAE in thousands of dollars (k$)." ] }, { @@ -169,20 +104,25 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.linear_model import LinearRegression\n", - "\n", "# solution\n", - "linear_regression = LinearRegression()\n", - "data_2d = data.reshape(-1, 1)\n", - "linear_regression.fit(data_2d, target)" + "print(\n", + " \"Mean absolute error on testing set: \"\n", + " f\"{-cv_results['test_score'].mean():.3f} k$ \u00b1 \"\n", + " f\"{cv_results['test_score'].std():.3f}\"\n", + ")" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "lines_to_next_cell": 2 + }, "source": [ - "Compute predictions from the linear regression model and plot both the data\n", - "and the predictions." + "Inspect the fitted model using a box plot to show the distribution of values\n", + "for the coefficients returned from the cross-validation. Hint: use the\n", + "function\n", + "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n", + "to create a box plot." ] }, { @@ -192,7 +132,11 @@ "outputs": [], "source": [ "# solution\n", - "predictions = linear_regression.predict(data_2d)" + "import pandas as pd\n", + "\n", + "weights = pd.DataFrame(\n", + " [est.coef_ for est in cv_results[\"estimator\"]], columns=data.columns\n", + ")" ] }, { @@ -205,28 +149,11 @@ }, "outputs": [], "source": [ - "ax = sns.scatterplot(\n", - " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "_ = ax.plot(data, predictions)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Compute the mean squared error" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "error = mean_squared_error(target, predictions)\n", - "print(f\"The MSE is {error}\")" + "import matplotlib.pyplot as plt\n", + "\n", + "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n", + "weights.plot.box(color=color, vert=False)\n", + "_ = plt.title(\"Value of linear regression coefficients\")" ] } ], diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb deleted file mode 100644 index 634c43171..000000000 --- a/notebooks/linear_models_sol_03.ipynb +++ /dev/null @@ -1,171 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# \ud83d\udcc3 Solution for Exercise M4.03\n", - "\n", - "In all previous notebooks, we only used a single feature in `data`. But we\n", - "have already shown that we could add new features to make the model more\n", - "expressive by deriving new features, based on the original feature.\n", - "\n", - "The aim of this notebook is to train a linear regression algorithm on a\n", - "dataset with more than a single feature.\n", - "\n", - "We will load a dataset about house prices in California. The dataset consists\n", - "of 8 features regarding the demography and geography of districts in\n", - "California and the aim is to predict the median house price of each district.\n", - "We will use all 8 features to predict the target, the median house price." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "

Note

\n", - "

If you want a deeper overview regarding this dataset, you can refer to the\n", - "Appendix - Datasets description section at the end of this MOOC.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import fetch_california_housing\n", - "\n", - "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n", - "target *= 100 # rescale the target in k$\n", - "data.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now it is your turn to train a linear regression model on this dataset. First,\n", - "create a linear regression model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "from sklearn.linear_model import LinearRegression\n", - "\n", - "linear_regression = LinearRegression()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n", - "as metric. Be sure to *return* the fitted *estimators*." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "from sklearn.model_selection import cross_validate\n", - "\n", - "cv_results = cross_validate(\n", - " linear_regression,\n", - " data,\n", - " target,\n", - " scoring=\"neg_mean_absolute_error\",\n", - " return_estimator=True,\n", - " cv=10,\n", - " n_jobs=2,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Compute the mean and std of the MAE in thousands of dollars (k$)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "print(\n", - " \"Mean absolute error on testing set: \"\n", - " f\"{-cv_results['test_score'].mean():.3f} k$ \u00b1 \"\n", - " f\"{cv_results['test_score'].std():.3f}\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "Inspect the fitted model using a box plot to show the distribution of values\n", - "for the coefficients returned from the cross-validation. Hint: use the\n", - "function\n", - "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n", - "to create a box plot." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "import pandas as pd\n", - "\n", - "weights = pd.DataFrame(\n", - " [est.coef_ for est in cv_results[\"estimator\"]], columns=data.columns\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "\n", - "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n", - "weights.plot.box(color=color, vert=False)\n", - "_ = plt.title(\"Value of linear regression coefficients\")" - ] - } - ], - "metadata": { - "jupytext": { - "main_language": "python" - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/python_scripts/linear_models_ex_02.py b/python_scripts/linear_models_ex_02.py index 640c44046..f58a1f0fe 100644 --- a/python_scripts/linear_models_ex_02.py +++ b/python_scripts/linear_models_ex_02.py @@ -14,100 +14,80 @@ # %% [markdown] # # 📝 Exercise M4.02 # -# The goal of this exercise is to build an intuition on what will be the -# parameters' values of a linear model when the link between the data and the -# target is non-linear. +# In the previous notebook, we showed that we can add new features based on the +# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. +# In that case we only used a single feature in `data`. # -# First, we will generate such non-linear data. +# The aim of this notebook is to train a linear regression algorithm on a +# dataset with more than a single feature. In such a "multi-dimensional" feature +# space we can derive new features of the form `x1 * x2`, `x2 * x3`, +# etc. Products of features are usually called "non-linear or +# multiplicative interactions" between features. # -# ```{tip} -# `np.random.RandomState` allows to create a random number generator which can -# be later used to get deterministic results. -# ``` - -# %% -import numpy as np - -# Set the seed for reproduction -rng = np.random.RandomState(0) - -# Generate data -n_sample = 100 -data_max, data_min = 1.4, -1.4 -len_data = data_max - data_min -data = rng.rand(n_sample) * len_data - len_data / 2 -noise = rng.randn(n_sample) * 0.3 -target = data**3 - 0.5 * data**2 + noise +# Feature engineering can be an important step of a model pipeline as long as +# the new features are expected to be predictive. For instance, think of a +# classification model to decide if a patient has risk of developing a heart +# disease. This would depend on the patient's Body Mass Index which is defined +# as `weight / height ** 2`. +# +# We load the dataset penguins dataset. We first use a set of 3 numerical +# features to predict the target, i.e. the body mass of the penguin. # %% [markdown] # ```{note} -# To ease the plotting, we will create a Pandas dataframe containing the data -# and target +# If you want a deeper overview regarding this dataset, you can refer to the +# Appendix - Datasets description section at the end of this MOOC. # ``` # %% import pandas as pd -full_data = pd.DataFrame({"data": data, "target": target}) +penguins = pd.read_csv("../datasets/penguins.csv") -# %% -import seaborn as sns +columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] +target_name = "Body Mass (g)" -_ = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 -) +# Remove lines with missing values for the columns of interest +penguins_non_missing = penguins[columns + [target_name]].dropna() -# %% [markdown] -# We observe that the link between the data `data` and vector `target` is -# non-linear. For instance, `data` could represent the years of experience -# (normalized) and `target` the salary (normalized). Therefore, the problem here -# would be to infer the salary given the years of experience. -# -# Using the function `f` defined below, find both the `weight` and the -# `intercept` that you think will lead to a good linear model. Plot both the -# data and the predictions of this model. - - -# %% -def f(data, weight=0, intercept=0): - target_predict = weight * data + intercept - return target_predict +data = penguins_non_missing[columns] +target = penguins_non_missing[target_name] +data.head() +# %% [markdown] +# Now it is your turn to train a linear regression model on this dataset. First, +# create a linear regression model. # %% # Write your code here. # %% [markdown] -# Compute the mean squared error for this model +# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) +# as metric. # %% # Write your code here. # %% [markdown] -# Train a linear regression model on this dataset. -# -# ```{warning} -# In scikit-learn, by convention `data` (also called `X` in the scikit-learn -# documentation) should be a 2D matrix of shape `(n_samples, n_features)`. -# If `data` is a 1D vector, you need to reshape it into a matrix with a -# single column if the vector represents a feature or a single row if the -# vector represents a sample. -# ``` +# Compute the mean and std of the MAE in grams (g). # %% -from sklearn.linear_model import LinearRegression - # Write your code here. # %% [markdown] -# Compute predictions from the linear regression model and plot both the data -# and the predictions. +# Now create a pipeline using `make_pipeline` consisting of a +# `PolynomialFeatures` and a linear regression. Set `degree=2` and +# `interaction_only=True` to the feature engineering step. Remember not to +# include the bias to avoid redundancies with the linear's regression intercept. +# +# Use the same strategy as before to cross-validate such a pipeline. # %% # Write your code here. # %% [markdown] -# Compute the mean squared error +# Compute the mean and std of the MAE in grams (g) and compare with the results +# without feature engineering. # %% # Write your code here. diff --git a/python_scripts/linear_models_ex_03.py b/python_scripts/linear_models_ex_03.py index 3ab6949a3..9c311e817 100644 --- a/python_scripts/linear_models_ex_03.py +++ b/python_scripts/linear_models_ex_03.py @@ -14,24 +14,14 @@ # %% [markdown] # # 📝 Exercise M4.03 # -# In the previous notebook, we showed that we can add new features based on the -# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. -# In that case we only used a single feature in `data`. +# The parameter `penalty` can control the **type** of regularization to use, +# whereas the regularization **strength** is set using the parameter `C`. +# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In +# this exercise, we ask you to train a logistic regression classifier using the +# `penalty="l2"` regularization (which happens to be the default in +# scikit-learn) to find by yourself the effect of the parameter `C`. # -# The aim of this notebook is to train a linear regression algorithm on a -# dataset with more than a single feature. In such a "multi-dimensional" feature -# space we can derive new features of the form `x1 * x2`, `x2 * x3`, -# etc. Products of features are usually called "non-linear or -# multiplicative interactions" between features. -# -# Feature engineering can be an important step of a model pipeline as long as -# the new features are expected to be predictive. For instance, think of a -# classification model to decide if a patient has risk of developing a heart -# disease. This would depend on the patient's Body Mass Index which is defined -# as `weight / height ** 2`. -# -# We load the dataset penguins dataset. We first use a set of 3 numerical -# features to predict the target, i.e. the body mass of the penguin. +# We start by loading the dataset. # %% [markdown] # ```{note} @@ -42,52 +32,51 @@ # %% import pandas as pd -penguins = pd.read_csv("../datasets/penguins.csv") +penguins = pd.read_csv("../datasets/penguins_classification.csv") +# only keep the Adelie and Chinstrap classes +penguins = ( + penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() +) -columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] -target_name = "Body Mass (g)" +culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] +target_column = "Species" -# Remove lines with missing values for the columns of interest -penguins_non_missing = penguins[columns + [target_name]].dropna() +# %% +from sklearn.model_selection import train_test_split -data = penguins_non_missing[columns] -target = penguins_non_missing[target_name] -data.head() +penguins_train, penguins_test = train_test_split(penguins, random_state=0) -# %% [markdown] -# Now it is your turn to train a linear regression model on this dataset. First, -# create a linear regression model. +data_train = penguins_train[culmen_columns] +data_test = penguins_test[culmen_columns] -# %% -# Write your code here. +target_train = penguins_train[target_column] +target_test = penguins_test[target_column] # %% [markdown] -# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) -# as metric. +# First, let's create our predictive model. # %% -# Write your code here. +from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LogisticRegression -# %% [markdown] -# Compute the mean and std of the MAE in grams (g). - -# %% -# Write your code here. +logistic_regression = make_pipeline( + StandardScaler(), LogisticRegression(penalty="l2") +) # %% [markdown] -# Now create a pipeline using `make_pipeline` consisting of a -# `PolynomialFeatures` and a linear regression. Set `degree=2` and -# `interaction_only=True` to the feature engineering step. Remember not to -# include the bias to avoid redundancies with the linear's regression intercept. -# -# Use the same strategy as before to cross-validate such a pipeline. +# Given the following candidates for the `C` parameter, find out the impact of +# `C` on the classifier decision boundary. You can use +# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the +# decision function boundary. # %% +Cs = [0.01, 0.1, 1, 10] + # Write your code here. # %% [markdown] -# Compute the mean and std of the MAE in grams (g) and compare with the results -# without feature engineering. +# Look at the impact of the `C` hyperparameter on the magnitude of the weights. # %% # Write your code here. diff --git a/python_scripts/linear_models_ex_04.py b/python_scripts/linear_models_ex_04.py deleted file mode 100644 index ef365713a..000000000 --- a/python_scripts/linear_models_ex_04.py +++ /dev/null @@ -1,82 +0,0 @@ -# --- -# jupyter: -# jupytext: -# text_representation: -# extension: .py -# format_name: percent -# format_version: '1.3' -# jupytext_version: 1.14.5 -# kernelspec: -# display_name: Python 3 -# name: python3 -# --- - -# %% [markdown] -# # 📝 Exercise M4.04 -# -# The parameter `penalty` can control the **type** of regularization to use, -# whereas the regularization **strength** is set using the parameter `C`. -# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In -# this exercise, we ask you to train a logistic regression classifier using the -# `penalty="l2"` regularization (which happens to be the default in -# scikit-learn) to find by yourself the effect of the parameter `C`. -# -# We will start by loading the dataset. - -# %% [markdown] -# ```{note} -# If you want a deeper overview regarding this dataset, you can refer to the -# Appendix - Datasets description section at the end of this MOOC. -# ``` - -# %% -import pandas as pd - -penguins = pd.read_csv("../datasets/penguins_classification.csv") -# only keep the Adelie and Chinstrap classes -penguins = ( - penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() -) - -culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] -target_column = "Species" - -# %% -from sklearn.model_selection import train_test_split - -penguins_train, penguins_test = train_test_split(penguins, random_state=0) - -data_train = penguins_train[culmen_columns] -data_test = penguins_test[culmen_columns] - -target_train = penguins_train[target_column] -target_test = penguins_test[target_column] - -# %% [markdown] -# First, let's create our predictive model. - -# %% -from sklearn.pipeline import make_pipeline -from sklearn.preprocessing import StandardScaler -from sklearn.linear_model import LogisticRegression - -logistic_regression = make_pipeline( - StandardScaler(), LogisticRegression(penalty="l2") -) - -# %% [markdown] -# Given the following candidates for the `C` parameter, find out the impact of -# `C` on the classifier decision boundary. You can use -# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the -# decision function boundary. - -# %% -Cs = [0.01, 0.1, 1, 10] - -# Write your code here. - -# %% [markdown] -# Look at the impact of the `C` hyperparameter on the magnitude of the weights. - -# %% -# Write your code here. diff --git a/python_scripts/linear_models_sol_02.py b/python_scripts/linear_models_sol_02.py index d62a4b983..3abc476da 100644 --- a/python_scripts/linear_models_sol_02.py +++ b/python_scripts/linear_models_sol_02.py @@ -8,123 +8,127 @@ # %% [markdown] # # 📃 Solution for Exercise M4.02 # -# The goal of this exercise is to build an intuition on what will be the -# parameters' values of a linear model when the link between the data and the -# target is non-linear. +# In the previous notebook, we showed that we can add new features based on the +# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. +# In that case we only used a single feature in `data`. # -# First, we will generate such non-linear data. +# The aim of this notebook is to train a linear regression algorithm on a +# dataset with more than a single feature. In such a "multi-dimensional" feature +# space we can derive new features of the form `x1 * x2`, `x2 * x3`, +# etc. Products of features are usually called "non-linear or +# multiplicative interactions" between features. # -# ```{tip} -# `np.random.RandomState` allows to create a random number generator which can -# be later used to get deterministic results. -# ``` - -# %% -import numpy as np - -# Set the seed for reproduction -rng = np.random.RandomState(0) - -# Generate data -n_sample = 100 -data_max, data_min = 1.4, -1.4 -len_data = data_max - data_min -data = rng.rand(n_sample) * len_data - len_data / 2 -noise = rng.randn(n_sample) * 0.3 -target = data**3 - 0.5 * data**2 + noise +# Feature engineering can be an important step of a model pipeline as long as +# the new features are expected to be predictive. For instance, think of a +# classification model to decide if a patient has risk of developing a heart +# disease. This would depend on the patient's Body Mass Index which is defined +# as `weight / height ** 2`. +# +# We load the dataset penguins dataset. We first use a set of 3 numerical +# features to predict the target, i.e. the body mass of the penguin. # %% [markdown] # ```{note} -# To ease the plotting, we will create a Pandas dataframe containing the data -# and target +# If you want a deeper overview regarding this dataset, you can refer to the +# Appendix - Datasets description section at the end of this MOOC. # ``` # %% import pandas as pd -full_data = pd.DataFrame({"data": data, "target": target}) - -# %% -import seaborn as sns - -_ = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 -) +penguins = pd.read_csv("../datasets/penguins.csv") -# %% [markdown] -# We observe that the link between the data `data` and vector `target` is -# non-linear. For instance, `data` could represent the years of experience -# (normalized) and `target` the salary (normalized). Therefore, the problem here -# would be to infer the salary given the years of experience. -# -# Using the function `f` defined below, find both the `weight` and the -# `intercept` that you think will lead to a good linear model. Plot both the -# data and the predictions of this model. +columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] +target_name = "Body Mass (g)" +# Remove lines with missing values for the columns of interest +penguins_non_missing = penguins[columns + [target_name]].dropna() -# %% -def f(data, weight=0, intercept=0): - target_predict = weight * data + intercept - return target_predict +data = penguins_non_missing[columns] +target = penguins_non_missing[target_name] +data.head() +# %% [markdown] +# Now it is your turn to train a linear regression model on this dataset. First, +# create a linear regression model. # %% # solution -predictions = f(data, weight=1.2, intercept=-0.2) +from sklearn.linear_model import LinearRegression -# %% tags=["solution"] -ax = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 -) -_ = ax.plot(data, predictions) +linear_regression = LinearRegression() # %% [markdown] -# Compute the mean squared error for this model +# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) +# as metric. # %% # solution -from sklearn.metrics import mean_squared_error - -error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2)) -print(f"The MSE is {error}") +from sklearn.model_selection import cross_validate + +cv_results = cross_validate( + linear_regression, + data, + target, + cv=10, + scoring="neg_mean_absolute_error", + n_jobs=2, +) # %% [markdown] -# Train a linear regression model on this dataset. -# -# ```{warning} -# In scikit-learn, by convention `data` (also called `X` in the scikit-learn -# documentation) should be a 2D matrix of shape `(n_samples, n_features)`. -# If `data` is a 1D vector, you need to reshape it into a matrix with a -# single column if the vector represents a feature or a single row if the -# vector represents a sample. -# ``` +# Compute the mean and std of the MAE in grams (g). # %% -from sklearn.linear_model import LinearRegression - # solution -linear_regression = LinearRegression() -data_2d = data.reshape(-1, 1) -linear_regression.fit(data_2d, target) +print( + "Mean absolute error on testing set with original features: " + f"{-cv_results['test_score'].mean():.3f} ± " + f"{cv_results['test_score'].std():.3f} g" +) # %% [markdown] -# Compute predictions from the linear regression model and plot both the data -# and the predictions. +# Now create a pipeline using `make_pipeline` consisting of a +# `PolynomialFeatures` and a linear regression. Set `degree=2` and +# `interaction_only=True` to the feature engineering step. Remember not to +# include the bias to avoid redundancies with the linear's regression intercept. +# +# Use the same strategy as before to cross-validate such a pipeline. # %% # solution -predictions = linear_regression.predict(data_2d) +from sklearn.preprocessing import PolynomialFeatures +from sklearn.pipeline import make_pipeline -# %% tags=["solution"] -ax = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 +poly_features = PolynomialFeatures( + degree=2, include_bias=False, interaction_only=True +) +linear_regression_interactions = make_pipeline( + poly_features, linear_regression +) + +cv_results = cross_validate( + linear_regression_interactions, + data, + target, + cv=10, + scoring="neg_mean_absolute_error", + n_jobs=2, ) -_ = ax.plot(data, predictions) # %% [markdown] -# Compute the mean squared error +# Compute the mean and std of the MAE in grams (g) and compare with the results +# without feature engineering. # %% # solution -error = mean_squared_error(target, predictions) -print(f"The MSE is {error}") +print( + "Mean absolute error on testing set with interactions: " + f"{-cv_results['test_score'].mean():.3f} ± " + f"{cv_results['test_score'].std():.3f} g" +) + +# %% [markdown] tags=["solution"] +# We observe that the mean absolute error is lower and less spread with the +# enriched features. In this case the "interactions" are indeed predictive. In +# the following notebook we will see what happens when the enriched features are +# non-predictive and how to deal with this case. diff --git a/python_scripts/linear_models_sol_03.py b/python_scripts/linear_models_sol_03.py index 0cacfcf0d..d789c8522 100644 --- a/python_scripts/linear_models_sol_03.py +++ b/python_scripts/linear_models_sol_03.py @@ -8,24 +8,14 @@ # %% [markdown] # # 📃 Solution for Exercise M4.03 # -# In the previous notebook, we showed that we can add new features based on the -# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. -# In that case we only used a single feature in `data`. +# The parameter `penalty` can control the **type** of regularization to use, +# whereas the regularization **strength** is set using the parameter `C`. +# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In +# this exercise, we ask you to train a logistic regression classifier using the +# `penalty="l2"` regularization (which happens to be the default in +# scikit-learn) to find by yourself the effect of the parameter `C`. # -# The aim of this notebook is to train a linear regression algorithm on a -# dataset with more than a single feature. In such a "multi-dimensional" feature -# space we can derive new features of the form `x1 * x2`, `x2 * x3`, -# etc. Products of features are usually called "non-linear or -# multiplicative interactions" between features. -# -# Feature engineering can be an important step of a model pipeline as long as -# the new features are expected to be predictive. For instance, think of a -# classification model to decide if a patient has risk of developing a heart -# disease. This would depend on the patient's Body Mass Index which is defined -# as `weight / height ** 2`. -# -# We load the dataset penguins dataset. We first use a set of 3 numerical -# features to predict the target, i.e. the body mass of the penguin. +# We start by loading the dataset. # %% [markdown] # ```{note} @@ -36,99 +26,97 @@ # %% import pandas as pd -penguins = pd.read_csv("../datasets/penguins.csv") - -columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] -target_name = "Body Mass (g)" - -# Remove lines with missing values for the columns of interest -penguins_non_missing = penguins[columns + [target_name]].dropna() - -data = penguins_non_missing[columns] -target = penguins_non_missing[target_name] -data.head() +penguins = pd.read_csv("../datasets/penguins_classification.csv") +# only keep the Adelie and Chinstrap classes +penguins = ( + penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() +) -# %% [markdown] -# Now it is your turn to train a linear regression model on this dataset. First, -# create a linear regression model. +culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] +target_column = "Species" # %% -# solution -from sklearn.linear_model import LinearRegression +from sklearn.model_selection import train_test_split -linear_regression = LinearRegression() +penguins_train, penguins_test = train_test_split(penguins, random_state=0) -# %% [markdown] -# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) -# as metric. +data_train = penguins_train[culmen_columns] +data_test = penguins_test[culmen_columns] -# %% -# solution -from sklearn.model_selection import cross_validate - -cv_results = cross_validate( - linear_regression, - data, - target, - cv=10, - scoring="neg_mean_absolute_error", - n_jobs=2, -) +target_train = penguins_train[target_column] +target_test = penguins_test[target_column] # %% [markdown] -# Compute the mean and std of the MAE in grams (g). +# First, let's create our predictive model. # %% -# solution -print( - "Mean absolute error on testing set with original features: " - f"{-cv_results['test_score'].mean():.3f} ± " - f"{cv_results['test_score'].std():.3f} g" +from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LogisticRegression + +logistic_regression = make_pipeline( + StandardScaler(), LogisticRegression(penalty="l2") ) # %% [markdown] -# Now create a pipeline using `make_pipeline` consisting of a -# `PolynomialFeatures` and a linear regression. Set `degree=2` and -# `interaction_only=True` to the feature engineering step. Remember not to -# include the bias to avoid redundancies with the linear's regression intercept. -# -# Use the same strategy as before to cross-validate such a pipeline. +# Given the following candidates for the `C` parameter, find out the impact of +# `C` on the classifier decision boundary. You can use +# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the +# decision function boundary. # %% -# solution -from sklearn.preprocessing import PolynomialFeatures -from sklearn.pipeline import make_pipeline - -poly_features = PolynomialFeatures( - degree=2, include_bias=False, interaction_only=True -) -linear_regression_interactions = make_pipeline( - poly_features, linear_regression -) +Cs = [0.01, 0.1, 1, 10] -cv_results = cross_validate( - linear_regression_interactions, - data, - target, - cv=10, - scoring="neg_mean_absolute_error", - n_jobs=2, -) +# solution +import matplotlib.pyplot as plt +import seaborn as sns +from sklearn.inspection import DecisionBoundaryDisplay + +for C in Cs: + logistic_regression.set_params(logisticregression__C=C) + logistic_regression.fit(data_train, target_train) + accuracy = logistic_regression.score(data_test, target_test) + + DecisionBoundaryDisplay.from_estimator( + logistic_regression, + data_test, + response_method="predict", + cmap="RdBu_r", + alpha=0.5, + ) + sns.scatterplot( + data=penguins_test, + x=culmen_columns[0], + y=culmen_columns[1], + hue=target_column, + palette=["tab:red", "tab:blue"], + ) + plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left") + plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}") # %% [markdown] -# Compute the mean and std of the MAE in grams (g) and compare with the results -# without feature engineering. +# Look at the impact of the `C` hyperparameter on the magnitude of the weights. # %% # solution -print( - "Mean absolute error on testing set with interactions: " - f"{-cv_results['test_score'].mean():.3f} ± " - f"{cv_results['test_score'].std():.3f} g" -) +weights_ridge = [] +for C in Cs: + logistic_regression.set_params(logisticregression__C=C) + logistic_regression.fit(data_train, target_train) + coefs = logistic_regression[-1].coef_[0] + weights_ridge.append(pd.Series(coefs, index=culmen_columns)) + +# %% tags=["solution"] +weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs]) +weights_ridge.plot.barh() +_ = plt.title("LogisticRegression weights depending of C") # %% [markdown] tags=["solution"] -# We observe that the mean absolute error is lower and less spread with the -# enriched features. In this case the "interactions" are indeed predictive. In -# the following notebook we will see what happens when the enriched features are -# non-predictive and how to deal with this case. +# We see that a small `C` will shrink the weights values toward zero. It means +# that a small `C` provides a more regularized model. Thus, `C` is the inverse +# of the `alpha` coefficient in the `Ridge` model. +# +# Besides, with a strong penalty (i.e. small `C` value), the weight of the +# feature "Culmen Depth (mm)" is almost zero. It explains why the decision +# separation in the plot is almost perpendicular to the "Culmen Length (mm)" +# feature. diff --git a/python_scripts/linear_models_sol_04.py b/python_scripts/linear_models_sol_04.py deleted file mode 100644 index 358abce52..000000000 --- a/python_scripts/linear_models_sol_04.py +++ /dev/null @@ -1,122 +0,0 @@ -# --- -# jupyter: -# kernelspec: -# display_name: Python 3 -# name: python3 -# --- - -# %% [markdown] -# # 📃 Solution for Exercise M4.04 -# -# The parameter `penalty` can control the **type** of regularization to use, -# whereas the regularization **strength** is set using the parameter `C`. -# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In -# this exercise, we ask you to train a logistic regression classifier using the -# `penalty="l2"` regularization (which happens to be the default in -# scikit-learn) to find by yourself the effect of the parameter `C`. -# -# We start by loading the dataset. - -# %% [markdown] -# ```{note} -# If you want a deeper overview regarding this dataset, you can refer to the -# Appendix - Datasets description section at the end of this MOOC. -# ``` - -# %% -import pandas as pd - -penguins = pd.read_csv("../datasets/penguins_classification.csv") -# only keep the Adelie and Chinstrap classes -penguins = ( - penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() -) - -culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] -target_column = "Species" - -# %% -from sklearn.model_selection import train_test_split - -penguins_train, penguins_test = train_test_split(penguins, random_state=0) - -data_train = penguins_train[culmen_columns] -data_test = penguins_test[culmen_columns] - -target_train = penguins_train[target_column] -target_test = penguins_test[target_column] - -# %% [markdown] -# First, let's create our predictive model. - -# %% -from sklearn.pipeline import make_pipeline -from sklearn.preprocessing import StandardScaler -from sklearn.linear_model import LogisticRegression - -logistic_regression = make_pipeline( - StandardScaler(), LogisticRegression(penalty="l2") -) - -# %% [markdown] -# Given the following candidates for the `C` parameter, find out the impact of -# `C` on the classifier decision boundary. You can use -# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the -# decision function boundary. - -# %% -Cs = [0.01, 0.1, 1, 10] - -# solution -import matplotlib.pyplot as plt -import seaborn as sns -from sklearn.inspection import DecisionBoundaryDisplay - -for C in Cs: - logistic_regression.set_params(logisticregression__C=C) - logistic_regression.fit(data_train, target_train) - accuracy = logistic_regression.score(data_test, target_test) - - DecisionBoundaryDisplay.from_estimator( - logistic_regression, - data_test, - response_method="predict", - cmap="RdBu_r", - alpha=0.5, - ) - sns.scatterplot( - data=penguins_test, - x=culmen_columns[0], - y=culmen_columns[1], - hue=target_column, - palette=["tab:red", "tab:blue"], - ) - plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left") - plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}") - -# %% [markdown] -# Look at the impact of the `C` hyperparameter on the magnitude of the weights. - -# %% -# solution -weights_ridge = [] -for C in Cs: - logistic_regression.set_params(logisticregression__C=C) - logistic_regression.fit(data_train, target_train) - coefs = logistic_regression[-1].coef_[0] - weights_ridge.append(pd.Series(coefs, index=culmen_columns)) - -# %% tags=["solution"] -weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs]) -weights_ridge.plot.barh() -_ = plt.title("LogisticRegression weights depending of C") - -# %% [markdown] tags=["solution"] -# We see that a small `C` will shrink the weights values toward zero. It means -# that a small `C` provides a more regularized model. Thus, `C` is the inverse -# of the `alpha` coefficient in the `Ridge` model. -# -# Besides, with a strong penalty (i.e. small `C` value), the weight of the -# feature "Culmen Depth (mm)" is almost zero. It explains why the decision -# separation in the plot is almost perpendicular to the "Culmen Length (mm)" -# feature.