diff --git a/jupyter-book/_toc.yml b/jupyter-book/_toc.yml index dfc89c04f..80bb88aa3 100644 --- a/jupyter-book/_toc.yml +++ b/jupyter-book/_toc.yml @@ -91,34 +91,26 @@ parts: sections: - file: linear_models/linear_models_slides - file: linear_models/linear_models_quiz_m4_01 - - file: linear_models/linear_models_regression_index - sections: - file: python_scripts/linear_regression_without_sklearn - file: python_scripts/linear_models_ex_01 - file: python_scripts/linear_models_sol_01 - file: python_scripts/linear_regression_in_sklearn + - file: python_scripts/logistic_regression - file: linear_models/linear_models_quiz_m4_02 - file: linear_models/linear_models_non_linear_index sections: + - file: python_scripts/linear_regression_non_linear_link - file: python_scripts/linear_models_ex_02 - file: python_scripts/linear_models_sol_02 - - file: python_scripts/linear_regression_non_linear_link - - file: python_scripts/linear_models_ex_03 - - file: python_scripts/linear_models_sol_03 + - file: python_scripts/logistic_regression_non_linear - file: linear_models/linear_models_quiz_m4_03 - file: linear_models/linear_models_regularization_index sections: - file: linear_models/regularized_linear_models_slides - file: python_scripts/linear_models_regularization - - file: python_scripts/linear_models_ex_04 - - file: python_scripts/linear_models_sol_04 - file: linear_models/linear_models_quiz_m4_04 - - file: linear_models/linear_models_classification_index - sections: - - file: python_scripts/logistic_regression - - file: python_scripts/linear_models_ex_05 - - file: python_scripts/linear_models_sol_05 - - file: python_scripts/logistic_regression_non_linear + - file: python_scripts/linear_models_ex_03 + - file: python_scripts/linear_models_sol_03 - file: linear_models/linear_models_quiz_m4_05 - file: linear_models/linear_models_wrap_up_quiz - file: linear_models/linear_models_module_take_away diff --git a/jupyter-book/linear_models/linear_models_classification_index.md b/jupyter-book/linear_models/linear_models_classification_index.md deleted file mode 100644 index 81399c436..000000000 --- a/jupyter-book/linear_models/linear_models_classification_index.md +++ /dev/null @@ -1,5 +0,0 @@ -# Linear model for classification - -```{tableofcontents} - -``` diff --git a/jupyter-book/linear_models/linear_models_non_linear_index.md b/jupyter-book/linear_models/linear_models_non_linear_index.md index d56614515..22fe06b20 100644 --- a/jupyter-book/linear_models/linear_models_non_linear_index.md +++ b/jupyter-book/linear_models/linear_models_non_linear_index.md @@ -1,4 +1,4 @@ -# Modelling non-linear features-target relationships +# Non-linear feature engineering for linear models ```{tableofcontents} diff --git a/jupyter-book/linear_models/linear_models_regression_index.md b/jupyter-book/linear_models/linear_models_regression_index.md deleted file mode 100644 index 8b8144a84..000000000 --- a/jupyter-book/linear_models/linear_models_regression_index.md +++ /dev/null @@ -1,5 +0,0 @@ -# Linear regression - -```{tableofcontents} - -``` diff --git a/notebooks/linear_models_ex_02.ipynb b/notebooks/linear_models_ex_02.ipynb index c9c0aad96..4cf750e81 100644 --- a/notebooks/linear_models_ex_02.ipynb +++ b/notebooks/linear_models_ex_02.ipynb @@ -4,39 +4,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# \ud83d\udcdd Exercise M4.02\n", + "# \ud83d\udcdd Exercise M4.03\n", "\n", - "The goal of this exercise is to build an intuition on what will be the\n", - "parameters' values of a linear model when the link between the data and the\n", - "target is non-linear.\n", + "In all previous notebooks, we only used a single feature in `data`. But we\n", + "have already shown that we could add new features to make the model more\n", + "expressive by deriving new features, based on the original feature.\n", "\n", - "First, we will generate such non-linear data.\n", + "The aim of this notebook is to train a linear regression algorithm on a\n", + "dataset with more than a single feature.\n", "\n", - "
\n", - "

Tip

\n", - "

np.random.RandomState allows to create a random number generator which can\n", - "be later used to get deterministic results.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "# Set the seed for reproduction\n", - "rng = np.random.RandomState(0)\n", - "\n", - "# Generate data\n", - "n_sample = 100\n", - "data_max, data_min = 1.4, -1.4\n", - "len_data = data_max - data_min\n", - "data = rng.rand(n_sample) * len_data - len_data / 2\n", - "noise = rng.randn(n_sample) * 0.3\n", - "target = data**3 - 0.5 * data**2 + noise" + "We will load a dataset about house prices in California. The dataset consists\n", + "of 8 features regarding the demography and geography of districts in\n", + "California and the aim is to predict the median house price of each district.\n", + "We will use all 8 features to predict the target, the median house price." ] }, { @@ -45,8 +25,8 @@ "source": [ "
\n", "

Note

\n", - "

To ease the plotting, we will create a Pandas dataframe containing the data\n", - "and target

\n", + "

If you want a deeper overview regarding this dataset, you can refer to the\n", + "Appendix - Datasets description section at the end of this MOOC.

\n", "
" ] }, @@ -56,65 +36,19 @@ "metadata": {}, "outputs": [], "source": [ - "import pandas as pd\n", + "from sklearn.datasets import fetch_california_housing\n", "\n", - "full_data = pd.DataFrame({\"data\": data, \"target\": target})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import seaborn as sns\n", - "\n", - "_ = sns.scatterplot(\n", - " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "We observe that the link between the data `data` and vector `target` is\n", - "non-linear. For instance, `data` could represent the years of experience\n", - "(normalized) and `target` the salary (normalized). Therefore, the problem here\n", - "would be to infer the salary given the years of experience.\n", - "\n", - "Using the function `f` defined below, find both the `weight` and the\n", - "`intercept` that you think will lead to a good linear model. Plot both the\n", - "data and the predictions of this model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def f(data, weight=0, intercept=0):\n", - " target_predict = weight * data + intercept\n", - " return target_predict" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." + "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n", + "target *= 100 # rescale the target in k$\n", + "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Compute the mean squared error for this model" + "Now it is your turn to train a linear regression model on this dataset. First,\n", + "create a linear regression model." ] }, { @@ -130,16 +64,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Train a linear regression model on this dataset.\n", - "\n", - "
\n", - "

Warning

\n", - "

In scikit-learn, by convention data (also called X in the scikit-learn\n", - "documentation) should be a 2D matrix of shape (n_samples, n_features).\n", - "If data is a 1D vector, you need to reshape it into a matrix with a\n", - "single column if the vector represents a feature or a single row if the\n", - "vector represents a sample.

\n", - "
" + "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n", + "as metric. Be sure to *return* the fitted *estimators*." ] }, { @@ -148,8 +74,6 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.linear_model import LinearRegression\n", - "\n", "# Write your code here." ] }, @@ -157,8 +81,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Compute predictions from the linear regression model and plot both the data\n", - "and the predictions." + "Compute the mean and std of the MAE in thousands of dollars (k$)." ] }, { @@ -172,9 +95,15 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "lines_to_next_cell": 2 + }, "source": [ - "Compute the mean squared error" + "Inspect the fitted model using a box plot to show the distribution of values\n", + "for the coefficients returned from the cross-validation. Hint: use the\n", + "function\n", + "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n", + "to create a box plot." ] }, { diff --git a/notebooks/linear_models_ex_03.ipynb b/notebooks/linear_models_ex_03.ipynb deleted file mode 100644 index 4cf750e81..000000000 --- a/notebooks/linear_models_ex_03.ipynb +++ /dev/null @@ -1,130 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# \ud83d\udcdd Exercise M4.03\n", - "\n", - "In all previous notebooks, we only used a single feature in `data`. But we\n", - "have already shown that we could add new features to make the model more\n", - "expressive by deriving new features, based on the original feature.\n", - "\n", - "The aim of this notebook is to train a linear regression algorithm on a\n", - "dataset with more than a single feature.\n", - "\n", - "We will load a dataset about house prices in California. The dataset consists\n", - "of 8 features regarding the demography and geography of districts in\n", - "California and the aim is to predict the median house price of each district.\n", - "We will use all 8 features to predict the target, the median house price." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "

Note

\n", - "

If you want a deeper overview regarding this dataset, you can refer to the\n", - "Appendix - Datasets description section at the end of this MOOC.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import fetch_california_housing\n", - "\n", - "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n", - "target *= 100 # rescale the target in k$\n", - "data.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now it is your turn to train a linear regression model on this dataset. First,\n", - "create a linear regression model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n", - "as metric. Be sure to *return* the fitted *estimators*." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Compute the mean and std of the MAE in thousands of dollars (k$)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "Inspect the fitted model using a box plot to show the distribution of values\n", - "for the coefficients returned from the cross-validation. Hint: use the\n", - "function\n", - "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n", - "to create a box plot." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - } - ], - "metadata": { - "jupytext": { - "main_language": "python" - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/notebooks/linear_models_ex_04.ipynb b/notebooks/linear_models_ex_04.ipynb deleted file mode 100644 index 77086778b..000000000 --- a/notebooks/linear_models_ex_04.ipynb +++ /dev/null @@ -1,165 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# \ud83d\udcdd Exercise M4.04\n", - "\n", - "In the previous notebook, we saw the effect of applying some regularization on\n", - "the coefficient of a linear model.\n", - "\n", - "In this exercise, we will study the advantage of using some regularization\n", - "when dealing with correlated features.\n", - "\n", - "We will first create a regression dataset. This dataset will contain 2,000\n", - "samples and 5 features from which only 2 features will be informative." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import make_regression\n", - "\n", - "data, target, coef = make_regression(\n", - " n_samples=2_000,\n", - " n_features=5,\n", - " n_informative=2,\n", - " shuffle=False,\n", - " coef=True,\n", - " random_state=0,\n", - " noise=30,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When creating the dataset, `make_regression` returns the true coefficient used\n", - "to generate the dataset. Let's plot this information." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "feature_names = [\n", - " \"Relevant feature #0\",\n", - " \"Relevant feature #1\",\n", - " \"Noisy feature #0\",\n", - " \"Noisy feature #1\",\n", - " \"Noisy feature #2\",\n", - "]\n", - "coef = pd.Series(coef, index=feature_names)\n", - "coef.plot.barh()\n", - "coef" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a `LinearRegression` regressor and fit on the entire dataset and check\n", - "the value of the coefficients. Are the coefficients of the linear regressor\n", - "close to the coefficients used to generate the dataset?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, create a new dataset that will be the same as `data` with 4 additional\n", - "columns that will repeat twice features 0 and 1. This procedure will create\n", - "perfectly correlated features." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Fit again the linear regressor on this new dataset and check the coefficients.\n", - "What do you observe?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a ridge regressor and fit on the same dataset. Check the coefficients.\n", - "What do you observe?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Can you find the relationship between the ridge coefficients and the original\n", - "coefficients?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - } - ], - "metadata": { - "jupytext": { - "main_language": "python" - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/notebooks/linear_models_ex_05.ipynb b/notebooks/linear_models_ex_05.ipynb deleted file mode 100644 index 866d52086..000000000 --- a/notebooks/linear_models_ex_05.ipynb +++ /dev/null @@ -1,137 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# \ud83d\udcdd Exercise M4.05\n", - "\n", - "In the previous notebook we set `penalty=\"none\"` to disable regularization\n", - "entirely. This parameter can also control the **type** of regularization to\n", - "use, whereas the regularization **strength** is set using the parameter `C`.\n", - "Setting`penalty=\"none\"` is equivalent to an infinitely large value of `C`. In\n", - "this exercise, we ask you to train a logistic regression classifier using the\n", - "`penalty=\"l2\"` regularization (which happens to be the default in\n", - "scikit-learn) to find by yourself the effect of the parameter `C`.\n", - "\n", - "We will start by loading the dataset." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "

Note

\n", - "

If you want a deeper overview regarding this dataset, you can refer to the\n", - "Appendix - Datasets description section at the end of this MOOC.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n", - "# only keep the Adelie and Chinstrap classes\n", - "penguins = (\n", - " penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n", - ")\n", - "\n", - "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n", - "target_column = \"Species\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.model_selection import train_test_split\n", - "\n", - "penguins_train, penguins_test = train_test_split(penguins, random_state=0)\n", - "\n", - "data_train = penguins_train[culmen_columns]\n", - "data_test = penguins_test[culmen_columns]\n", - "\n", - "target_train = penguins_train[target_column]\n", - "target_test = penguins_test[target_column]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, let's create our predictive model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.pipeline import make_pipeline\n", - "from sklearn.preprocessing import StandardScaler\n", - "from sklearn.linear_model import LogisticRegression\n", - "\n", - "logistic_regression = make_pipeline(\n", - " StandardScaler(), LogisticRegression(penalty=\"l2\")\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Given the following candidates for the `C` parameter, find out the impact of\n", - "`C` on the classifier decision boundary. You can use\n", - "`sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the\n", - "decision function boundary." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "Cs = [0.01, 0.1, 1, 10]\n", - "\n", - "# Write your code here." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Look at the impact of the `C` hyperparameter on the magnitude of the weights." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Write your code here." - ] - } - ], - "metadata": { - "jupytext": { - "main_language": "python" - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/notebooks/linear_models_sol_02.ipynb b/notebooks/linear_models_sol_02.ipynb index d56864c4e..634c43171 100644 --- a/notebooks/linear_models_sol_02.ipynb +++ b/notebooks/linear_models_sol_02.ipynb @@ -4,39 +4,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# \ud83d\udcc3 Solution for Exercise M4.02\n", + "# \ud83d\udcc3 Solution for Exercise M4.03\n", "\n", - "The goal of this exercise is to build an intuition on what will be the\n", - "parameters' values of a linear model when the link between the data and the\n", - "target is non-linear.\n", + "In all previous notebooks, we only used a single feature in `data`. But we\n", + "have already shown that we could add new features to make the model more\n", + "expressive by deriving new features, based on the original feature.\n", "\n", - "First, we will generate such non-linear data.\n", + "The aim of this notebook is to train a linear regression algorithm on a\n", + "dataset with more than a single feature.\n", "\n", - "
\n", - "

Tip

\n", - "

np.random.RandomState allows to create a random number generator which can\n", - "be later used to get deterministic results.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "# Set the seed for reproduction\n", - "rng = np.random.RandomState(0)\n", - "\n", - "# Generate data\n", - "n_sample = 100\n", - "data_max, data_min = 1.4, -1.4\n", - "len_data = data_max - data_min\n", - "data = rng.rand(n_sample) * len_data - len_data / 2\n", - "noise = rng.randn(n_sample) * 0.3\n", - "target = data**3 - 0.5 * data**2 + noise" + "We will load a dataset about house prices in California. The dataset consists\n", + "of 8 features regarding the demography and geography of districts in\n", + "California and the aim is to predict the median house price of each district.\n", + "We will use all 8 features to predict the target, the median house price." ] }, { @@ -45,8 +25,8 @@ "source": [ "
\n", "

Note

\n", - "

To ease the plotting, we will create a Pandas dataframe containing the data\n", - "and target

\n", + "

If you want a deeper overview regarding this dataset, you can refer to the\n", + "Appendix - Datasets description section at the end of this MOOC.

\n", "
" ] }, @@ -56,49 +36,19 @@ "metadata": {}, "outputs": [], "source": [ - "import pandas as pd\n", - "\n", - "full_data = pd.DataFrame({\"data\": data, \"target\": target})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import seaborn as sns\n", + "from sklearn.datasets import fetch_california_housing\n", "\n", - "_ = sns.scatterplot(\n", - " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n", - ")" + "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n", + "target *= 100 # rescale the target in k$\n", + "data.head()" ] }, { "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "We observe that the link between the data `data` and vector `target` is\n", - "non-linear. For instance, `data` could represent the years of experience\n", - "(normalized) and `target` the salary (normalized). Therefore, the problem here\n", - "would be to infer the salary given the years of experience.\n", - "\n", - "Using the function `f` defined below, find both the `weight` and the\n", - "`intercept` that you think will lead to a good linear model. Plot both the\n", - "data and the predictions of this model." - ] - }, - { - "cell_type": "code", - "execution_count": null, "metadata": {}, - "outputs": [], "source": [ - "def f(data, weight=0, intercept=0):\n", - " target_predict = weight * data + intercept\n", - " return target_predict" + "Now it is your turn to train a linear regression model on this dataset. First,\n", + "create a linear regression model." ] }, { @@ -108,30 +58,17 @@ "outputs": [], "source": [ "# solution\n", - "predictions = f(data, weight=1.2, intercept=-0.2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "ax = sns.scatterplot(\n", - " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "_ = ax.plot(data, predictions)" + "from sklearn.linear_model import LinearRegression\n", + "\n", + "linear_regression = LinearRegression()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Compute the mean squared error for this model" + "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n", + "as metric. Be sure to *return* the fitted *estimators*." ] }, { @@ -141,26 +78,24 @@ "outputs": [], "source": [ "# solution\n", - "from sklearn.metrics import mean_squared_error\n", + "from sklearn.model_selection import cross_validate\n", "\n", - "error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2))\n", - "print(f\"The MSE is {error}\")" + "cv_results = cross_validate(\n", + " linear_regression,\n", + " data,\n", + " target,\n", + " scoring=\"neg_mean_absolute_error\",\n", + " return_estimator=True,\n", + " cv=10,\n", + " n_jobs=2,\n", + ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Train a linear regression model on this dataset.\n", - "\n", - "
\n", - "

Warning

\n", - "

In scikit-learn, by convention data (also called X in the scikit-learn\n", - "documentation) should be a 2D matrix of shape (n_samples, n_features).\n", - "If data is a 1D vector, you need to reshape it into a matrix with a\n", - "single column if the vector represents a feature or a single row if the\n", - "vector represents a sample.

\n", - "
" + "Compute the mean and std of the MAE in thousands of dollars (k$)." ] }, { @@ -169,20 +104,25 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.linear_model import LinearRegression\n", - "\n", "# solution\n", - "linear_regression = LinearRegression()\n", - "data_2d = data.reshape(-1, 1)\n", - "linear_regression.fit(data_2d, target)" + "print(\n", + " \"Mean absolute error on testing set: \"\n", + " f\"{-cv_results['test_score'].mean():.3f} k$ \u00b1 \"\n", + " f\"{cv_results['test_score'].std():.3f}\"\n", + ")" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "lines_to_next_cell": 2 + }, "source": [ - "Compute predictions from the linear regression model and plot both the data\n", - "and the predictions." + "Inspect the fitted model using a box plot to show the distribution of values\n", + "for the coefficients returned from the cross-validation. Hint: use the\n", + "function\n", + "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n", + "to create a box plot." ] }, { @@ -192,7 +132,11 @@ "outputs": [], "source": [ "# solution\n", - "predictions = linear_regression.predict(data_2d)" + "import pandas as pd\n", + "\n", + "weights = pd.DataFrame(\n", + " [est.coef_ for est in cv_results[\"estimator\"]], columns=data.columns\n", + ")" ] }, { @@ -205,28 +149,11 @@ }, "outputs": [], "source": [ - "ax = sns.scatterplot(\n", - " data=full_data, x=\"data\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "_ = ax.plot(data, predictions)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Compute the mean squared error" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "error = mean_squared_error(target, predictions)\n", - "print(f\"The MSE is {error}\")" + "import matplotlib.pyplot as plt\n", + "\n", + "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n", + "weights.plot.box(color=color, vert=False)\n", + "_ = plt.title(\"Value of linear regression coefficients\")" ] } ], diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb deleted file mode 100644 index 634c43171..000000000 --- a/notebooks/linear_models_sol_03.ipynb +++ /dev/null @@ -1,171 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# \ud83d\udcc3 Solution for Exercise M4.03\n", - "\n", - "In all previous notebooks, we only used a single feature in `data`. But we\n", - "have already shown that we could add new features to make the model more\n", - "expressive by deriving new features, based on the original feature.\n", - "\n", - "The aim of this notebook is to train a linear regression algorithm on a\n", - "dataset with more than a single feature.\n", - "\n", - "We will load a dataset about house prices in California. The dataset consists\n", - "of 8 features regarding the demography and geography of districts in\n", - "California and the aim is to predict the median house price of each district.\n", - "We will use all 8 features to predict the target, the median house price." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "

Note

\n", - "

If you want a deeper overview regarding this dataset, you can refer to the\n", - "Appendix - Datasets description section at the end of this MOOC.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import fetch_california_housing\n", - "\n", - "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n", - "target *= 100 # rescale the target in k$\n", - "data.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now it is your turn to train a linear regression model on this dataset. First,\n", - "create a linear regression model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "from sklearn.linear_model import LinearRegression\n", - "\n", - "linear_regression = LinearRegression()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n", - "as metric. Be sure to *return* the fitted *estimators*." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "from sklearn.model_selection import cross_validate\n", - "\n", - "cv_results = cross_validate(\n", - " linear_regression,\n", - " data,\n", - " target,\n", - " scoring=\"neg_mean_absolute_error\",\n", - " return_estimator=True,\n", - " cv=10,\n", - " n_jobs=2,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Compute the mean and std of the MAE in thousands of dollars (k$)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "print(\n", - " \"Mean absolute error on testing set: \"\n", - " f\"{-cv_results['test_score'].mean():.3f} k$ \u00b1 \"\n", - " f\"{cv_results['test_score'].std():.3f}\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "Inspect the fitted model using a box plot to show the distribution of values\n", - "for the coefficients returned from the cross-validation. Hint: use the\n", - "function\n", - "[`df.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html)\n", - "to create a box plot." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "import pandas as pd\n", - "\n", - "weights = pd.DataFrame(\n", - " [est.coef_ for est in cv_results[\"estimator\"]], columns=data.columns\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "\n", - "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n", - "weights.plot.box(color=color, vert=False)\n", - "_ = plt.title(\"Value of linear regression coefficients\")" - ] - } - ], - "metadata": { - "jupytext": { - "main_language": "python" - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/notebooks/linear_models_sol_04.ipynb b/notebooks/linear_models_sol_04.ipynb deleted file mode 100644 index f49b0c465..000000000 --- a/notebooks/linear_models_sol_04.ipynb +++ /dev/null @@ -1,492 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# \ud83d\udcc3 Solution for Exercise M4.04\n", - "\n", - "In the previous notebook, we saw the effect of applying some regularization on\n", - "the coefficient of a linear model.\n", - "\n", - "In this exercise, we will study the advantage of using some regularization\n", - "when dealing with correlated features.\n", - "\n", - "We will first create a regression dataset. This dataset will contain 2,000\n", - "samples and 5 features from which only 2 features will be informative." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import make_regression\n", - "\n", - "data, target, coef = make_regression(\n", - " n_samples=2_000,\n", - " n_features=5,\n", - " n_informative=2,\n", - " shuffle=False,\n", - " coef=True,\n", - " random_state=0,\n", - " noise=30,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When creating the dataset, `make_regression` returns the true coefficient used\n", - "to generate the dataset. Let's plot this information." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "feature_names = [\n", - " \"Relevant feature #0\",\n", - " \"Relevant feature #1\",\n", - " \"Noisy feature #0\",\n", - " \"Noisy feature #1\",\n", - " \"Noisy feature #2\",\n", - "]\n", - "coef = pd.Series(coef, index=feature_names)\n", - "coef.plot.barh()\n", - "coef" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a `LinearRegression` regressor and fit on the entire dataset and check\n", - "the value of the coefficients. Are the coefficients of the linear regressor\n", - "close to the coefficients used to generate the dataset?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "from sklearn.linear_model import LinearRegression\n", - "\n", - "linear_regression = LinearRegression()\n", - "linear_regression.fit(data, target)\n", - "linear_regression.coef_" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "feature_names = [\n", - " \"Relevant feature #0\",\n", - " \"Relevant feature #1\",\n", - " \"Noisy feature #0\",\n", - " \"Noisy feature #1\",\n", - " \"Noisy feature #2\",\n", - "]\n", - "coef = pd.Series(linear_regression.coef_, index=feature_names)\n", - "_ = coef.plot.barh()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, - "source": [ - "We see that the coefficients are close to the coefficients used to generate\n", - "the dataset. The dispersion is indeed cause by the noise injected during the\n", - "dataset generation." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, create a new dataset that will be the same as `data` with 4 additional\n", - "columns that will repeat twice features 0 and 1. This procedure will create\n", - "perfectly correlated features." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "import numpy as np\n", - "\n", - "data = np.concatenate([data, data[:, [0, 1]], data[:, [0, 1]]], axis=1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Fit again the linear regressor on this new dataset and check the coefficients.\n", - "What do you observe?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "linear_regression = LinearRegression()\n", - "linear_regression.fit(data, target)\n", - "linear_regression.coef_" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "feature_names = [\n", - " \"Relevant feature #0\",\n", - " \"Relevant feature #1\",\n", - " \"Noisy feature #0\",\n", - " \"Noisy feature #1\",\n", - " \"Noisy feature #2\",\n", - " \"First repetition of feature #0\",\n", - " \"First repetition of feature #1\",\n", - " \"Second repetition of feature #0\",\n", - " \"Second repetition of feature #1\",\n", - "]\n", - "coef = pd.Series(linear_regression.coef_, index=feature_names)\n", - "_ = coef.plot.barh()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, - "source": [ - "We see that the coefficient values are far from what one could expect. By\n", - "repeating the informative features, one would have expected these coefficients\n", - "to be similarly informative.\n", - "\n", - "Instead, we see that some coefficients have a huge norm ~1e14. It indeed means\n", - "that we try to solve an mathematical ill-posed problem. Indeed, finding\n", - "coefficients in a linear regression involves inverting the matrix\n", - "`np.dot(data.T, data)` which is not possible (or lead to high numerical\n", - "errors)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a ridge regressor and fit on the same dataset. Check the coefficients.\n", - "What do you observe?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "from sklearn.linear_model import Ridge\n", - "\n", - "ridge = Ridge()\n", - "ridge.fit(data, target)\n", - "ridge.coef_" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "coef = pd.Series(ridge.coef_, index=feature_names)\n", - "_ = coef.plot.barh()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, - "source": [ - "We see that the penalty applied on the weights give a better results: the\n", - "values of the coefficients do not suffer from numerical issues. Indeed, the\n", - "matrix to be inverted internally is `np.dot(data.T, data) + alpha * I`. Adding\n", - "this penalty `alpha` allow the inversion without numerical issue." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Can you find the relationship between the ridge coefficients and the original\n", - "coefficients?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "ridge.coef_[:5] * 3" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, - "source": [ - "Repeating three times each informative features induced to divide the ridge\n", - "coefficients by three." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, - "source": [ - "
\n", - "

Tip

\n", - "

We advise to always use a penalty to shrink the magnitude of the weights\n", - "toward zero (also called \"l2 penalty\"). In scikit-learn, LogisticRegression\n", - "applies such penalty by default. However, one needs to use Ridge (and even\n", - "RidgeCV to tune the parameter alpha) instead of LinearRegression.

\n", - "

Other kinds of regularizations exist but will not be covered in this course.

\n", - "
\n", - "\n", - "## Dealing with correlation between one-hot encoded features\n", - "\n", - "In this section, we will focus on how to deal with correlated features that\n", - "arise naturally when one-hot encoding categorical features.\n", - "\n", - "Let's first load the Ames housing dataset and take a subset of features that\n", - "are only categorical features." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "from sklearn.model_selection import train_test_split\n", - "\n", - "ames_housing = pd.read_csv(\"../datasets/house_prices.csv\", na_values=\"?\")\n", - "ames_housing = ames_housing.drop(columns=\"Id\")\n", - "\n", - "categorical_columns = [\"Street\", \"Foundation\", \"CentralAir\", \"PavedDrive\"]\n", - "target_name = \"SalePrice\"\n", - "X, y = ames_housing[categorical_columns], ames_housing[target_name]\n", - "\n", - "X_train, X_test, y_train, y_test = train_test_split(\n", - " X, y, test_size=0.2, random_state=0\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, - "source": [ - "\n", - "We previously presented that a `OneHotEncoder` creates as many columns as\n", - "categories. Therefore, there is always one column (i.e. one encoded category)\n", - "that can be inferred from the others. Thus, `OneHotEncoder` creates collinear\n", - "features.\n", - "\n", - "We illustrate this behaviour by considering the \"CentralAir\" feature that\n", - "contains only two categories:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "X_train[\"CentralAir\"]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "from sklearn.preprocessing import OneHotEncoder\n", - "\n", - "single_feature = [\"CentralAir\"]\n", - "encoder = OneHotEncoder(sparse_output=False, dtype=np.int32)\n", - "X_trans = encoder.fit_transform(X_train[single_feature])\n", - "X_trans = pd.DataFrame(\n", - " X_trans,\n", - " columns=encoder.get_feature_names_out(input_features=single_feature),\n", - ")\n", - "X_trans" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, - "source": [ - "\n", - "Here, we see that the encoded category \"CentralAir_N\" is the opposite of the\n", - "encoded category \"CentralAir_Y\". Therefore, we observe that using a\n", - "`OneHotEncoder` creates two features having the problematic pattern observed\n", - "earlier in this exercise. Training a linear regression model on such a of\n", - "one-hot encoded binary feature can therefore lead to numerical problems,\n", - "especially without regularization. Furthermore, the two one-hot features are\n", - "redundant as they encode exactly the same information in opposite ways.\n", - "\n", - "Using regularization helps to overcome the numerical issues that we\n", - "highlighted earlier in this exercise.\n", - "\n", - "Another strategy is to arbitrarily drop one of the encoded categories.\n", - "Scikit-learn provides such an option by setting the parameter `drop` in the\n", - "`OneHotEncoder`. This parameter can be set to `first` to always drop the first\n", - "encoded category or `binary_only` to only drop a column in the case of binary\n", - "categories." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "encoder = OneHotEncoder(drop=\"first\", sparse_output=False, dtype=np.int32)\n", - "X_trans = encoder.fit_transform(X_train[single_feature])\n", - "X_trans = pd.DataFrame(\n", - " X_trans,\n", - " columns=encoder.get_feature_names_out(input_features=single_feature),\n", - ")\n", - "X_trans" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, - "source": [ - "\n", - "We see that only the second column of the previous encoded data is kept.\n", - "Dropping one of the one-hot encoded column is a common practice, especially\n", - "for binary categorical features. Note however that this breaks symmetry\n", - "between categories and impacts the number of coefficients of the model, their\n", - "values, and thus their meaning, especially when applying strong\n", - "regularization.\n", - "\n", - "Let's finally illustrate how to use this option is a machine-learning\n", - "pipeline:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "from sklearn.pipeline import make_pipeline\n", - "\n", - "model = make_pipeline(OneHotEncoder(drop=\"first\", dtype=np.int32), Ridge())\n", - "model.fit(X_train, y_train)\n", - "n_categories = [X_train[col].nunique() for col in X_train.columns]\n", - "print(f\"R2 score on the testing set: {model.score(X_test, y_test):.2f}\")\n", - "print(\n", - " f\"Our model contains {model[-1].coef_.size} features while \"\n", - " f\"{sum(n_categories)} categories are originally available.\"\n", - ")" - ] - } - ], - "metadata": { - "jupytext": { - "main_language": "python" - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/notebooks/linear_models_sol_05.ipynb b/notebooks/linear_models_sol_05.ipynb deleted file mode 100644 index 08bae2e77..000000000 --- a/notebooks/linear_models_sol_05.ipynb +++ /dev/null @@ -1,201 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# \ud83d\udcc3 Solution for Exercise M4.05\n", - "\n", - "In the previous notebook we set `penalty=\"none\"` to disable regularization\n", - "entirely. This parameter can also control the **type** of regularization to\n", - "use, whereas the regularization **strength** is set using the parameter `C`.\n", - "Setting`penalty=\"none\"` is equivalent to an infinitely large value of `C`. In\n", - "this exercise, we ask you to train a logistic regression classifier using the\n", - "`penalty=\"l2\"` regularization (which happens to be the default in\n", - "scikit-learn) to find by yourself the effect of the parameter `C`.\n", - "\n", - "We will start by loading the dataset." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "

Note

\n", - "

If you want a deeper overview regarding this dataset, you can refer to the\n", - "Appendix - Datasets description section at the end of this MOOC.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n", - "# only keep the Adelie and Chinstrap classes\n", - "penguins = (\n", - " penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n", - ")\n", - "\n", - "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n", - "target_column = \"Species\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.model_selection import train_test_split\n", - "\n", - "penguins_train, penguins_test = train_test_split(penguins, random_state=0)\n", - "\n", - "data_train = penguins_train[culmen_columns]\n", - "data_test = penguins_test[culmen_columns]\n", - "\n", - "target_train = penguins_train[target_column]\n", - "target_test = penguins_test[target_column]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, let's create our predictive model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.pipeline import make_pipeline\n", - "from sklearn.preprocessing import StandardScaler\n", - "from sklearn.linear_model import LogisticRegression\n", - "\n", - "logistic_regression = make_pipeline(\n", - " StandardScaler(), LogisticRegression(penalty=\"l2\")\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Given the following candidates for the `C` parameter, find out the impact of\n", - "`C` on the classifier decision boundary. You can use\n", - "`sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the\n", - "decision function boundary." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "Cs = [0.01, 0.1, 1, 10]\n", - "\n", - "# solution\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "from sklearn.inspection import DecisionBoundaryDisplay\n", - "\n", - "for C in Cs:\n", - " logistic_regression.set_params(logisticregression__C=C)\n", - " logistic_regression.fit(data_train, target_train)\n", - " accuracy = logistic_regression.score(data_test, target_test)\n", - "\n", - " DecisionBoundaryDisplay.from_estimator(\n", - " logistic_regression,\n", - " data_test,\n", - " response_method=\"predict\",\n", - " cmap=\"RdBu_r\",\n", - " alpha=0.5,\n", - " )\n", - " sns.scatterplot(\n", - " data=penguins_test,\n", - " x=culmen_columns[0],\n", - " y=culmen_columns[1],\n", - " hue=target_column,\n", - " palette=[\"tab:red\", \"tab:blue\"],\n", - " )\n", - " plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", - " plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Look at the impact of the `C` hyperparameter on the magnitude of the weights." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# solution\n", - "weights_ridge = []\n", - "for C in Cs:\n", - " logistic_regression.set_params(logisticregression__C=C)\n", - " logistic_regression.fit(data_train, target_train)\n", - " coefs = logistic_regression[-1].coef_[0]\n", - " weights_ridge.append(pd.Series(coefs, index=culmen_columns))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "solution" - ] - }, - "outputs": [], - "source": [ - "weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f\"C: {C}\" for C in Cs])\n", - "weights_ridge.plot.barh()\n", - "_ = plt.title(\"LogisticRegression weights depending of C\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, - "source": [ - "We see that a small `C` will shrink the weights values toward zero. It means\n", - "that a small `C` provides a more regularized model. Thus, `C` is the inverse\n", - "of the `alpha` coefficient in the `Ridge` model.\n", - "\n", - "Besides, with a strong penalty (i.e. small `C` value), the weight of the\n", - "feature \"Culmen Depth (mm)\" is almost zero. It explains why the decision\n", - "separation in the plot is almost perpendicular to the \"Culmen Length (mm)\"\n", - "feature." - ] - } - ], - "metadata": { - "jupytext": { - "main_language": "python" - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/python_scripts/linear_models_ex_02.py b/python_scripts/linear_models_ex_02.py index 640c44046..f58a1f0fe 100644 --- a/python_scripts/linear_models_ex_02.py +++ b/python_scripts/linear_models_ex_02.py @@ -14,100 +14,80 @@ # %% [markdown] # # 📝 Exercise M4.02 # -# The goal of this exercise is to build an intuition on what will be the -# parameters' values of a linear model when the link between the data and the -# target is non-linear. +# In the previous notebook, we showed that we can add new features based on the +# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. +# In that case we only used a single feature in `data`. # -# First, we will generate such non-linear data. +# The aim of this notebook is to train a linear regression algorithm on a +# dataset with more than a single feature. In such a "multi-dimensional" feature +# space we can derive new features of the form `x1 * x2`, `x2 * x3`, +# etc. Products of features are usually called "non-linear or +# multiplicative interactions" between features. # -# ```{tip} -# `np.random.RandomState` allows to create a random number generator which can -# be later used to get deterministic results. -# ``` - -# %% -import numpy as np - -# Set the seed for reproduction -rng = np.random.RandomState(0) - -# Generate data -n_sample = 100 -data_max, data_min = 1.4, -1.4 -len_data = data_max - data_min -data = rng.rand(n_sample) * len_data - len_data / 2 -noise = rng.randn(n_sample) * 0.3 -target = data**3 - 0.5 * data**2 + noise +# Feature engineering can be an important step of a model pipeline as long as +# the new features are expected to be predictive. For instance, think of a +# classification model to decide if a patient has risk of developing a heart +# disease. This would depend on the patient's Body Mass Index which is defined +# as `weight / height ** 2`. +# +# We load the dataset penguins dataset. We first use a set of 3 numerical +# features to predict the target, i.e. the body mass of the penguin. # %% [markdown] # ```{note} -# To ease the plotting, we will create a Pandas dataframe containing the data -# and target +# If you want a deeper overview regarding this dataset, you can refer to the +# Appendix - Datasets description section at the end of this MOOC. # ``` # %% import pandas as pd -full_data = pd.DataFrame({"data": data, "target": target}) +penguins = pd.read_csv("../datasets/penguins.csv") -# %% -import seaborn as sns +columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] +target_name = "Body Mass (g)" -_ = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 -) +# Remove lines with missing values for the columns of interest +penguins_non_missing = penguins[columns + [target_name]].dropna() -# %% [markdown] -# We observe that the link between the data `data` and vector `target` is -# non-linear. For instance, `data` could represent the years of experience -# (normalized) and `target` the salary (normalized). Therefore, the problem here -# would be to infer the salary given the years of experience. -# -# Using the function `f` defined below, find both the `weight` and the -# `intercept` that you think will lead to a good linear model. Plot both the -# data and the predictions of this model. - - -# %% -def f(data, weight=0, intercept=0): - target_predict = weight * data + intercept - return target_predict +data = penguins_non_missing[columns] +target = penguins_non_missing[target_name] +data.head() +# %% [markdown] +# Now it is your turn to train a linear regression model on this dataset. First, +# create a linear regression model. # %% # Write your code here. # %% [markdown] -# Compute the mean squared error for this model +# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) +# as metric. # %% # Write your code here. # %% [markdown] -# Train a linear regression model on this dataset. -# -# ```{warning} -# In scikit-learn, by convention `data` (also called `X` in the scikit-learn -# documentation) should be a 2D matrix of shape `(n_samples, n_features)`. -# If `data` is a 1D vector, you need to reshape it into a matrix with a -# single column if the vector represents a feature or a single row if the -# vector represents a sample. -# ``` +# Compute the mean and std of the MAE in grams (g). # %% -from sklearn.linear_model import LinearRegression - # Write your code here. # %% [markdown] -# Compute predictions from the linear regression model and plot both the data -# and the predictions. +# Now create a pipeline using `make_pipeline` consisting of a +# `PolynomialFeatures` and a linear regression. Set `degree=2` and +# `interaction_only=True` to the feature engineering step. Remember not to +# include the bias to avoid redundancies with the linear's regression intercept. +# +# Use the same strategy as before to cross-validate such a pipeline. # %% # Write your code here. # %% [markdown] -# Compute the mean squared error +# Compute the mean and std of the MAE in grams (g) and compare with the results +# without feature engineering. # %% # Write your code here. diff --git a/python_scripts/linear_models_ex_03.py b/python_scripts/linear_models_ex_03.py index 3ab6949a3..9c311e817 100644 --- a/python_scripts/linear_models_ex_03.py +++ b/python_scripts/linear_models_ex_03.py @@ -14,24 +14,14 @@ # %% [markdown] # # 📝 Exercise M4.03 # -# In the previous notebook, we showed that we can add new features based on the -# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. -# In that case we only used a single feature in `data`. +# The parameter `penalty` can control the **type** of regularization to use, +# whereas the regularization **strength** is set using the parameter `C`. +# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In +# this exercise, we ask you to train a logistic regression classifier using the +# `penalty="l2"` regularization (which happens to be the default in +# scikit-learn) to find by yourself the effect of the parameter `C`. # -# The aim of this notebook is to train a linear regression algorithm on a -# dataset with more than a single feature. In such a "multi-dimensional" feature -# space we can derive new features of the form `x1 * x2`, `x2 * x3`, -# etc. Products of features are usually called "non-linear or -# multiplicative interactions" between features. -# -# Feature engineering can be an important step of a model pipeline as long as -# the new features are expected to be predictive. For instance, think of a -# classification model to decide if a patient has risk of developing a heart -# disease. This would depend on the patient's Body Mass Index which is defined -# as `weight / height ** 2`. -# -# We load the dataset penguins dataset. We first use a set of 3 numerical -# features to predict the target, i.e. the body mass of the penguin. +# We start by loading the dataset. # %% [markdown] # ```{note} @@ -42,52 +32,51 @@ # %% import pandas as pd -penguins = pd.read_csv("../datasets/penguins.csv") +penguins = pd.read_csv("../datasets/penguins_classification.csv") +# only keep the Adelie and Chinstrap classes +penguins = ( + penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() +) -columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] -target_name = "Body Mass (g)" +culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] +target_column = "Species" -# Remove lines with missing values for the columns of interest -penguins_non_missing = penguins[columns + [target_name]].dropna() +# %% +from sklearn.model_selection import train_test_split -data = penguins_non_missing[columns] -target = penguins_non_missing[target_name] -data.head() +penguins_train, penguins_test = train_test_split(penguins, random_state=0) -# %% [markdown] -# Now it is your turn to train a linear regression model on this dataset. First, -# create a linear regression model. +data_train = penguins_train[culmen_columns] +data_test = penguins_test[culmen_columns] -# %% -# Write your code here. +target_train = penguins_train[target_column] +target_test = penguins_test[target_column] # %% [markdown] -# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) -# as metric. +# First, let's create our predictive model. # %% -# Write your code here. +from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LogisticRegression -# %% [markdown] -# Compute the mean and std of the MAE in grams (g). - -# %% -# Write your code here. +logistic_regression = make_pipeline( + StandardScaler(), LogisticRegression(penalty="l2") +) # %% [markdown] -# Now create a pipeline using `make_pipeline` consisting of a -# `PolynomialFeatures` and a linear regression. Set `degree=2` and -# `interaction_only=True` to the feature engineering step. Remember not to -# include the bias to avoid redundancies with the linear's regression intercept. -# -# Use the same strategy as before to cross-validate such a pipeline. +# Given the following candidates for the `C` parameter, find out the impact of +# `C` on the classifier decision boundary. You can use +# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the +# decision function boundary. # %% +Cs = [0.01, 0.1, 1, 10] + # Write your code here. # %% [markdown] -# Compute the mean and std of the MAE in grams (g) and compare with the results -# without feature engineering. +# Look at the impact of the `C` hyperparameter on the magnitude of the weights. # %% # Write your code here. diff --git a/python_scripts/linear_models_ex_04.py b/python_scripts/linear_models_ex_04.py deleted file mode 100644 index 18191bccf..000000000 --- a/python_scripts/linear_models_ex_04.py +++ /dev/null @@ -1,92 +0,0 @@ -# --- -# jupyter: -# jupytext: -# text_representation: -# extension: .py -# format_name: percent -# format_version: '1.3' -# jupytext_version: 1.14.5 -# kernelspec: -# display_name: Python 3 -# name: python3 -# --- - -# %% [markdown] -# # 📝 Exercise M4.04 -# -# In the previous notebook, we saw the effect of applying some regularization on -# the coefficient of a linear model. -# -# In this exercise, we will study the advantage of using some regularization -# when dealing with correlated features. -# -# We will first create a regression dataset. This dataset will contain 2,000 -# samples and 5 features from which only 2 features will be informative. - -# %% -from sklearn.datasets import make_regression - -data, target, coef = make_regression( - n_samples=2_000, - n_features=5, - n_informative=2, - shuffle=False, - coef=True, - random_state=0, - noise=30, -) - -# %% [markdown] -# When creating the dataset, `make_regression` returns the true coefficient used -# to generate the dataset. Let's plot this information. - -# %% -import pandas as pd - -feature_names = [ - "Relevant feature #0", - "Relevant feature #1", - "Noisy feature #0", - "Noisy feature #1", - "Noisy feature #2", -] -coef = pd.Series(coef, index=feature_names) -coef.plot.barh() -coef - -# %% [markdown] -# Create a `LinearRegression` regressor and fit on the entire dataset and check -# the value of the coefficients. Are the coefficients of the linear regressor -# close to the coefficients used to generate the dataset? - -# %% -# Write your code here. - -# %% [markdown] -# Now, create a new dataset that will be the same as `data` with 4 additional -# columns that will repeat twice features 0 and 1. This procedure will create -# perfectly correlated features. - -# %% -# Write your code here. - -# %% [markdown] -# Fit again the linear regressor on this new dataset and check the coefficients. -# What do you observe? - -# %% -# Write your code here. - -# %% [markdown] -# Create a ridge regressor and fit on the same dataset. Check the coefficients. -# What do you observe? - -# %% -# Write your code here. - -# %% [markdown] -# Can you find the relationship between the ridge coefficients and the original -# coefficients? - -# %% -# Write your code here. diff --git a/python_scripts/linear_models_ex_05.py b/python_scripts/linear_models_ex_05.py deleted file mode 100644 index 1c36b83c2..000000000 --- a/python_scripts/linear_models_ex_05.py +++ /dev/null @@ -1,83 +0,0 @@ -# --- -# jupyter: -# jupytext: -# text_representation: -# extension: .py -# format_name: percent -# format_version: '1.3' -# jupytext_version: 1.14.5 -# kernelspec: -# display_name: Python 3 -# name: python3 -# --- - -# %% [markdown] -# # 📝 Exercise M4.05 -# -# In the previous notebook we set `penalty="none"` to disable regularization -# entirely. This parameter can also control the **type** of regularization to -# use, whereas the regularization **strength** is set using the parameter `C`. -# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In -# this exercise, we ask you to train a logistic regression classifier using the -# `penalty="l2"` regularization (which happens to be the default in -# scikit-learn) to find by yourself the effect of the parameter `C`. -# -# We will start by loading the dataset. - -# %% [markdown] -# ```{note} -# If you want a deeper overview regarding this dataset, you can refer to the -# Appendix - Datasets description section at the end of this MOOC. -# ``` - -# %% -import pandas as pd - -penguins = pd.read_csv("../datasets/penguins_classification.csv") -# only keep the Adelie and Chinstrap classes -penguins = ( - penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() -) - -culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] -target_column = "Species" - -# %% -from sklearn.model_selection import train_test_split - -penguins_train, penguins_test = train_test_split(penguins, random_state=0) - -data_train = penguins_train[culmen_columns] -data_test = penguins_test[culmen_columns] - -target_train = penguins_train[target_column] -target_test = penguins_test[target_column] - -# %% [markdown] -# First, let's create our predictive model. - -# %% -from sklearn.pipeline import make_pipeline -from sklearn.preprocessing import StandardScaler -from sklearn.linear_model import LogisticRegression - -logistic_regression = make_pipeline( - StandardScaler(), LogisticRegression(penalty="l2") -) - -# %% [markdown] -# Given the following candidates for the `C` parameter, find out the impact of -# `C` on the classifier decision boundary. You can use -# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the -# decision function boundary. - -# %% -Cs = [0.01, 0.1, 1, 10] - -# Write your code here. - -# %% [markdown] -# Look at the impact of the `C` hyperparameter on the magnitude of the weights. - -# %% -# Write your code here. diff --git a/python_scripts/linear_models_sol_02.py b/python_scripts/linear_models_sol_02.py index d62a4b983..3abc476da 100644 --- a/python_scripts/linear_models_sol_02.py +++ b/python_scripts/linear_models_sol_02.py @@ -8,123 +8,127 @@ # %% [markdown] # # 📃 Solution for Exercise M4.02 # -# The goal of this exercise is to build an intuition on what will be the -# parameters' values of a linear model when the link between the data and the -# target is non-linear. +# In the previous notebook, we showed that we can add new features based on the +# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. +# In that case we only used a single feature in `data`. # -# First, we will generate such non-linear data. +# The aim of this notebook is to train a linear regression algorithm on a +# dataset with more than a single feature. In such a "multi-dimensional" feature +# space we can derive new features of the form `x1 * x2`, `x2 * x3`, +# etc. Products of features are usually called "non-linear or +# multiplicative interactions" between features. # -# ```{tip} -# `np.random.RandomState` allows to create a random number generator which can -# be later used to get deterministic results. -# ``` - -# %% -import numpy as np - -# Set the seed for reproduction -rng = np.random.RandomState(0) - -# Generate data -n_sample = 100 -data_max, data_min = 1.4, -1.4 -len_data = data_max - data_min -data = rng.rand(n_sample) * len_data - len_data / 2 -noise = rng.randn(n_sample) * 0.3 -target = data**3 - 0.5 * data**2 + noise +# Feature engineering can be an important step of a model pipeline as long as +# the new features are expected to be predictive. For instance, think of a +# classification model to decide if a patient has risk of developing a heart +# disease. This would depend on the patient's Body Mass Index which is defined +# as `weight / height ** 2`. +# +# We load the dataset penguins dataset. We first use a set of 3 numerical +# features to predict the target, i.e. the body mass of the penguin. # %% [markdown] # ```{note} -# To ease the plotting, we will create a Pandas dataframe containing the data -# and target +# If you want a deeper overview regarding this dataset, you can refer to the +# Appendix - Datasets description section at the end of this MOOC. # ``` # %% import pandas as pd -full_data = pd.DataFrame({"data": data, "target": target}) - -# %% -import seaborn as sns - -_ = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 -) +penguins = pd.read_csv("../datasets/penguins.csv") -# %% [markdown] -# We observe that the link between the data `data` and vector `target` is -# non-linear. For instance, `data` could represent the years of experience -# (normalized) and `target` the salary (normalized). Therefore, the problem here -# would be to infer the salary given the years of experience. -# -# Using the function `f` defined below, find both the `weight` and the -# `intercept` that you think will lead to a good linear model. Plot both the -# data and the predictions of this model. +columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] +target_name = "Body Mass (g)" +# Remove lines with missing values for the columns of interest +penguins_non_missing = penguins[columns + [target_name]].dropna() -# %% -def f(data, weight=0, intercept=0): - target_predict = weight * data + intercept - return target_predict +data = penguins_non_missing[columns] +target = penguins_non_missing[target_name] +data.head() +# %% [markdown] +# Now it is your turn to train a linear regression model on this dataset. First, +# create a linear regression model. # %% # solution -predictions = f(data, weight=1.2, intercept=-0.2) +from sklearn.linear_model import LinearRegression -# %% tags=["solution"] -ax = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 -) -_ = ax.plot(data, predictions) +linear_regression = LinearRegression() # %% [markdown] -# Compute the mean squared error for this model +# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) +# as metric. # %% # solution -from sklearn.metrics import mean_squared_error - -error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2)) -print(f"The MSE is {error}") +from sklearn.model_selection import cross_validate + +cv_results = cross_validate( + linear_regression, + data, + target, + cv=10, + scoring="neg_mean_absolute_error", + n_jobs=2, +) # %% [markdown] -# Train a linear regression model on this dataset. -# -# ```{warning} -# In scikit-learn, by convention `data` (also called `X` in the scikit-learn -# documentation) should be a 2D matrix of shape `(n_samples, n_features)`. -# If `data` is a 1D vector, you need to reshape it into a matrix with a -# single column if the vector represents a feature or a single row if the -# vector represents a sample. -# ``` +# Compute the mean and std of the MAE in grams (g). # %% -from sklearn.linear_model import LinearRegression - # solution -linear_regression = LinearRegression() -data_2d = data.reshape(-1, 1) -linear_regression.fit(data_2d, target) +print( + "Mean absolute error on testing set with original features: " + f"{-cv_results['test_score'].mean():.3f} ± " + f"{cv_results['test_score'].std():.3f} g" +) # %% [markdown] -# Compute predictions from the linear regression model and plot both the data -# and the predictions. +# Now create a pipeline using `make_pipeline` consisting of a +# `PolynomialFeatures` and a linear regression. Set `degree=2` and +# `interaction_only=True` to the feature engineering step. Remember not to +# include the bias to avoid redundancies with the linear's regression intercept. +# +# Use the same strategy as before to cross-validate such a pipeline. # %% # solution -predictions = linear_regression.predict(data_2d) +from sklearn.preprocessing import PolynomialFeatures +from sklearn.pipeline import make_pipeline -# %% tags=["solution"] -ax = sns.scatterplot( - data=full_data, x="data", y="target", color="black", alpha=0.5 +poly_features = PolynomialFeatures( + degree=2, include_bias=False, interaction_only=True +) +linear_regression_interactions = make_pipeline( + poly_features, linear_regression +) + +cv_results = cross_validate( + linear_regression_interactions, + data, + target, + cv=10, + scoring="neg_mean_absolute_error", + n_jobs=2, ) -_ = ax.plot(data, predictions) # %% [markdown] -# Compute the mean squared error +# Compute the mean and std of the MAE in grams (g) and compare with the results +# without feature engineering. # %% # solution -error = mean_squared_error(target, predictions) -print(f"The MSE is {error}") +print( + "Mean absolute error on testing set with interactions: " + f"{-cv_results['test_score'].mean():.3f} ± " + f"{cv_results['test_score'].std():.3f} g" +) + +# %% [markdown] tags=["solution"] +# We observe that the mean absolute error is lower and less spread with the +# enriched features. In this case the "interactions" are indeed predictive. In +# the following notebook we will see what happens when the enriched features are +# non-predictive and how to deal with this case. diff --git a/python_scripts/linear_models_sol_03.py b/python_scripts/linear_models_sol_03.py index 0cacfcf0d..d789c8522 100644 --- a/python_scripts/linear_models_sol_03.py +++ b/python_scripts/linear_models_sol_03.py @@ -8,24 +8,14 @@ # %% [markdown] # # 📃 Solution for Exercise M4.03 # -# In the previous notebook, we showed that we can add new features based on the -# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`. -# In that case we only used a single feature in `data`. +# The parameter `penalty` can control the **type** of regularization to use, +# whereas the regularization **strength** is set using the parameter `C`. +# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In +# this exercise, we ask you to train a logistic regression classifier using the +# `penalty="l2"` regularization (which happens to be the default in +# scikit-learn) to find by yourself the effect of the parameter `C`. # -# The aim of this notebook is to train a linear regression algorithm on a -# dataset with more than a single feature. In such a "multi-dimensional" feature -# space we can derive new features of the form `x1 * x2`, `x2 * x3`, -# etc. Products of features are usually called "non-linear or -# multiplicative interactions" between features. -# -# Feature engineering can be an important step of a model pipeline as long as -# the new features are expected to be predictive. For instance, think of a -# classification model to decide if a patient has risk of developing a heart -# disease. This would depend on the patient's Body Mass Index which is defined -# as `weight / height ** 2`. -# -# We load the dataset penguins dataset. We first use a set of 3 numerical -# features to predict the target, i.e. the body mass of the penguin. +# We start by loading the dataset. # %% [markdown] # ```{note} @@ -36,99 +26,97 @@ # %% import pandas as pd -penguins = pd.read_csv("../datasets/penguins.csv") - -columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"] -target_name = "Body Mass (g)" - -# Remove lines with missing values for the columns of interest -penguins_non_missing = penguins[columns + [target_name]].dropna() - -data = penguins_non_missing[columns] -target = penguins_non_missing[target_name] -data.head() +penguins = pd.read_csv("../datasets/penguins_classification.csv") +# only keep the Adelie and Chinstrap classes +penguins = ( + penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() +) -# %% [markdown] -# Now it is your turn to train a linear regression model on this dataset. First, -# create a linear regression model. +culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] +target_column = "Species" # %% -# solution -from sklearn.linear_model import LinearRegression +from sklearn.model_selection import train_test_split -linear_regression = LinearRegression() +penguins_train, penguins_test = train_test_split(penguins, random_state=0) -# %% [markdown] -# Execute a cross-validation with 10 folds and use the mean absolute error (MAE) -# as metric. +data_train = penguins_train[culmen_columns] +data_test = penguins_test[culmen_columns] -# %% -# solution -from sklearn.model_selection import cross_validate - -cv_results = cross_validate( - linear_regression, - data, - target, - cv=10, - scoring="neg_mean_absolute_error", - n_jobs=2, -) +target_train = penguins_train[target_column] +target_test = penguins_test[target_column] # %% [markdown] -# Compute the mean and std of the MAE in grams (g). +# First, let's create our predictive model. # %% -# solution -print( - "Mean absolute error on testing set with original features: " - f"{-cv_results['test_score'].mean():.3f} ± " - f"{cv_results['test_score'].std():.3f} g" +from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LogisticRegression + +logistic_regression = make_pipeline( + StandardScaler(), LogisticRegression(penalty="l2") ) # %% [markdown] -# Now create a pipeline using `make_pipeline` consisting of a -# `PolynomialFeatures` and a linear regression. Set `degree=2` and -# `interaction_only=True` to the feature engineering step. Remember not to -# include the bias to avoid redundancies with the linear's regression intercept. -# -# Use the same strategy as before to cross-validate such a pipeline. +# Given the following candidates for the `C` parameter, find out the impact of +# `C` on the classifier decision boundary. You can use +# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the +# decision function boundary. # %% -# solution -from sklearn.preprocessing import PolynomialFeatures -from sklearn.pipeline import make_pipeline - -poly_features = PolynomialFeatures( - degree=2, include_bias=False, interaction_only=True -) -linear_regression_interactions = make_pipeline( - poly_features, linear_regression -) +Cs = [0.01, 0.1, 1, 10] -cv_results = cross_validate( - linear_regression_interactions, - data, - target, - cv=10, - scoring="neg_mean_absolute_error", - n_jobs=2, -) +# solution +import matplotlib.pyplot as plt +import seaborn as sns +from sklearn.inspection import DecisionBoundaryDisplay + +for C in Cs: + logistic_regression.set_params(logisticregression__C=C) + logistic_regression.fit(data_train, target_train) + accuracy = logistic_regression.score(data_test, target_test) + + DecisionBoundaryDisplay.from_estimator( + logistic_regression, + data_test, + response_method="predict", + cmap="RdBu_r", + alpha=0.5, + ) + sns.scatterplot( + data=penguins_test, + x=culmen_columns[0], + y=culmen_columns[1], + hue=target_column, + palette=["tab:red", "tab:blue"], + ) + plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left") + plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}") # %% [markdown] -# Compute the mean and std of the MAE in grams (g) and compare with the results -# without feature engineering. +# Look at the impact of the `C` hyperparameter on the magnitude of the weights. # %% # solution -print( - "Mean absolute error on testing set with interactions: " - f"{-cv_results['test_score'].mean():.3f} ± " - f"{cv_results['test_score'].std():.3f} g" -) +weights_ridge = [] +for C in Cs: + logistic_regression.set_params(logisticregression__C=C) + logistic_regression.fit(data_train, target_train) + coefs = logistic_regression[-1].coef_[0] + weights_ridge.append(pd.Series(coefs, index=culmen_columns)) + +# %% tags=["solution"] +weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs]) +weights_ridge.plot.barh() +_ = plt.title("LogisticRegression weights depending of C") # %% [markdown] tags=["solution"] -# We observe that the mean absolute error is lower and less spread with the -# enriched features. In this case the "interactions" are indeed predictive. In -# the following notebook we will see what happens when the enriched features are -# non-predictive and how to deal with this case. +# We see that a small `C` will shrink the weights values toward zero. It means +# that a small `C` provides a more regularized model. Thus, `C` is the inverse +# of the `alpha` coefficient in the `Ridge` model. +# +# Besides, with a strong penalty (i.e. small `C` value), the weight of the +# feature "Culmen Depth (mm)" is almost zero. It explains why the decision +# separation in the plot is almost perpendicular to the "Culmen Length (mm)" +# feature. diff --git a/python_scripts/linear_models_sol_04.py b/python_scripts/linear_models_sol_04.py deleted file mode 100644 index a759c3d24..000000000 --- a/python_scripts/linear_models_sol_04.py +++ /dev/null @@ -1,269 +0,0 @@ -# --- -# jupyter: -# kernelspec: -# display_name: Python 3 -# name: python3 -# --- - -# %% [markdown] -# # 📃 Solution for Exercise M4.04 -# -# In the previous notebook, we saw the effect of applying some regularization on -# the coefficient of a linear model. -# -# In this exercise, we will study the advantage of using some regularization -# when dealing with correlated features. -# -# We will first create a regression dataset. This dataset will contain 2,000 -# samples and 5 features from which only 2 features will be informative. - -# %% -from sklearn.datasets import make_regression - -data, target, coef = make_regression( - n_samples=2_000, - n_features=5, - n_informative=2, - shuffle=False, - coef=True, - random_state=0, - noise=30, -) - -# %% [markdown] -# When creating the dataset, `make_regression` returns the true coefficient used -# to generate the dataset. Let's plot this information. - -# %% -import pandas as pd - -feature_names = [ - "Relevant feature #0", - "Relevant feature #1", - "Noisy feature #0", - "Noisy feature #1", - "Noisy feature #2", -] -coef = pd.Series(coef, index=feature_names) -coef.plot.barh() -coef - -# %% [markdown] -# Create a `LinearRegression` regressor and fit on the entire dataset and check -# the value of the coefficients. Are the coefficients of the linear regressor -# close to the coefficients used to generate the dataset? - -# %% -# solution -from sklearn.linear_model import LinearRegression - -linear_regression = LinearRegression() -linear_regression.fit(data, target) -linear_regression.coef_ - -# %% tags=["solution"] -feature_names = [ - "Relevant feature #0", - "Relevant feature #1", - "Noisy feature #0", - "Noisy feature #1", - "Noisy feature #2", -] -coef = pd.Series(linear_regression.coef_, index=feature_names) -_ = coef.plot.barh() - -# %% [markdown] tags=["solution"] -# We see that the coefficients are close to the coefficients used to generate -# the dataset. The dispersion is indeed cause by the noise injected during the -# dataset generation. - -# %% [markdown] -# Now, create a new dataset that will be the same as `data` with 4 additional -# columns that will repeat twice features 0 and 1. This procedure will create -# perfectly correlated features. - -# %% -# solution -import numpy as np - -data = np.concatenate([data, data[:, [0, 1]], data[:, [0, 1]]], axis=1) - -# %% [markdown] -# Fit again the linear regressor on this new dataset and check the coefficients. -# What do you observe? - -# %% -# solution -linear_regression = LinearRegression() -linear_regression.fit(data, target) -linear_regression.coef_ - -# %% tags=["solution"] -feature_names = [ - "Relevant feature #0", - "Relevant feature #1", - "Noisy feature #0", - "Noisy feature #1", - "Noisy feature #2", - "First repetition of feature #0", - "First repetition of feature #1", - "Second repetition of feature #0", - "Second repetition of feature #1", -] -coef = pd.Series(linear_regression.coef_, index=feature_names) -_ = coef.plot.barh() - -# %% [markdown] tags=["solution"] -# We see that the coefficient values are far from what one could expect. By -# repeating the informative features, one would have expected these coefficients -# to be similarly informative. -# -# Instead, we see that some coefficients have a huge norm ~1e14. It indeed means -# that we try to solve an mathematical ill-posed problem. Indeed, finding -# coefficients in a linear regression involves inverting the matrix -# `np.dot(data.T, data)` which is not possible (or lead to high numerical -# errors). - -# %% [markdown] -# Create a ridge regressor and fit on the same dataset. Check the coefficients. -# What do you observe? - -# %% -# solution -from sklearn.linear_model import Ridge - -ridge = Ridge() -ridge.fit(data, target) -ridge.coef_ - -# %% tags=["solution"] -coef = pd.Series(ridge.coef_, index=feature_names) -_ = coef.plot.barh() - -# %% [markdown] tags=["solution"] -# We see that the penalty applied on the weights give a better results: the -# values of the coefficients do not suffer from numerical issues. Indeed, the -# matrix to be inverted internally is `np.dot(data.T, data) + alpha * I`. Adding -# this penalty `alpha` allow the inversion without numerical issue. - -# %% [markdown] -# Can you find the relationship between the ridge coefficients and the original -# coefficients? - -# %% -# solution -ridge.coef_[:5] * 3 - -# %% [markdown] tags=["solution"] -# Repeating three times each informative features induced to divide the ridge -# coefficients by three. - -# %% [markdown] tags=["solution"] -# ```{tip} -# We advise to always use a penalty to shrink the magnitude of the weights -# toward zero (also called "l2 penalty"). In scikit-learn, `LogisticRegression` -# applies such penalty by default. However, one needs to use `Ridge` (and even -# `RidgeCV` to tune the parameter `alpha`) instead of `LinearRegression`. -# -# Other kinds of regularizations exist but will not be covered in this course. -# ``` -# -# ## Dealing with correlation between one-hot encoded features -# -# In this section, we will focus on how to deal with correlated features that -# arise naturally when one-hot encoding categorical features. -# -# Let's first load the Ames housing dataset and take a subset of features that -# are only categorical features. - -# %% tags=["solution"] -import pandas as pd -from sklearn.model_selection import train_test_split - -ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?") -ames_housing = ames_housing.drop(columns="Id") - -categorical_columns = ["Street", "Foundation", "CentralAir", "PavedDrive"] -target_name = "SalePrice" -X, y = ames_housing[categorical_columns], ames_housing[target_name] - -X_train, X_test, y_train, y_test = train_test_split( - X, y, test_size=0.2, random_state=0 -) - -# %% [markdown] tags=["solution"] -# -# We previously presented that a `OneHotEncoder` creates as many columns as -# categories. Therefore, there is always one column (i.e. one encoded category) -# that can be inferred from the others. Thus, `OneHotEncoder` creates collinear -# features. -# -# We illustrate this behaviour by considering the "CentralAir" feature that -# contains only two categories: - -# %% tags=["solution"] -X_train["CentralAir"] - -# %% tags=["solution"] -from sklearn.preprocessing import OneHotEncoder - -single_feature = ["CentralAir"] -encoder = OneHotEncoder(sparse_output=False, dtype=np.int32) -X_trans = encoder.fit_transform(X_train[single_feature]) -X_trans = pd.DataFrame( - X_trans, - columns=encoder.get_feature_names_out(input_features=single_feature), -) -X_trans - -# %% [markdown] tags=["solution"] -# -# Here, we see that the encoded category "CentralAir_N" is the opposite of the -# encoded category "CentralAir_Y". Therefore, we observe that using a -# `OneHotEncoder` creates two features having the problematic pattern observed -# earlier in this exercise. Training a linear regression model on such a of -# one-hot encoded binary feature can therefore lead to numerical problems, -# especially without regularization. Furthermore, the two one-hot features are -# redundant as they encode exactly the same information in opposite ways. -# -# Using regularization helps to overcome the numerical issues that we -# highlighted earlier in this exercise. -# -# Another strategy is to arbitrarily drop one of the encoded categories. -# Scikit-learn provides such an option by setting the parameter `drop` in the -# `OneHotEncoder`. This parameter can be set to `first` to always drop the first -# encoded category or `binary_only` to only drop a column in the case of binary -# categories. - -# %% tags=["solution"] -encoder = OneHotEncoder(drop="first", sparse_output=False, dtype=np.int32) -X_trans = encoder.fit_transform(X_train[single_feature]) -X_trans = pd.DataFrame( - X_trans, - columns=encoder.get_feature_names_out(input_features=single_feature), -) -X_trans - -# %% [markdown] tags=["solution"] -# -# We see that only the second column of the previous encoded data is kept. -# Dropping one of the one-hot encoded column is a common practice, especially -# for binary categorical features. Note however that this breaks symmetry -# between categories and impacts the number of coefficients of the model, their -# values, and thus their meaning, especially when applying strong -# regularization. -# -# Let's finally illustrate how to use this option is a machine-learning -# pipeline: - -# %% tags=["solution"] -from sklearn.pipeline import make_pipeline - -model = make_pipeline(OneHotEncoder(drop="first", dtype=np.int32), Ridge()) -model.fit(X_train, y_train) -n_categories = [X_train[col].nunique() for col in X_train.columns] -print(f"R2 score on the testing set: {model.score(X_test, y_test):.2f}") -print( - f"Our model contains {model[-1].coef_.size} features while " - f"{sum(n_categories)} categories are originally available." -) diff --git a/python_scripts/linear_models_sol_05.py b/python_scripts/linear_models_sol_05.py deleted file mode 100644 index bc4a15df1..000000000 --- a/python_scripts/linear_models_sol_05.py +++ /dev/null @@ -1,123 +0,0 @@ -# --- -# jupyter: -# kernelspec: -# display_name: Python 3 -# name: python3 -# --- - -# %% [markdown] -# # 📃 Solution for Exercise M4.05 -# -# In the previous notebook we set `penalty="none"` to disable regularization -# entirely. This parameter can also control the **type** of regularization to -# use, whereas the regularization **strength** is set using the parameter `C`. -# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In -# this exercise, we ask you to train a logistic regression classifier using the -# `penalty="l2"` regularization (which happens to be the default in -# scikit-learn) to find by yourself the effect of the parameter `C`. -# -# We will start by loading the dataset. - -# %% [markdown] -# ```{note} -# If you want a deeper overview regarding this dataset, you can refer to the -# Appendix - Datasets description section at the end of this MOOC. -# ``` - -# %% -import pandas as pd - -penguins = pd.read_csv("../datasets/penguins_classification.csv") -# only keep the Adelie and Chinstrap classes -penguins = ( - penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() -) - -culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] -target_column = "Species" - -# %% -from sklearn.model_selection import train_test_split - -penguins_train, penguins_test = train_test_split(penguins, random_state=0) - -data_train = penguins_train[culmen_columns] -data_test = penguins_test[culmen_columns] - -target_train = penguins_train[target_column] -target_test = penguins_test[target_column] - -# %% [markdown] -# First, let's create our predictive model. - -# %% -from sklearn.pipeline import make_pipeline -from sklearn.preprocessing import StandardScaler -from sklearn.linear_model import LogisticRegression - -logistic_regression = make_pipeline( - StandardScaler(), LogisticRegression(penalty="l2") -) - -# %% [markdown] -# Given the following candidates for the `C` parameter, find out the impact of -# `C` on the classifier decision boundary. You can use -# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the -# decision function boundary. - -# %% -Cs = [0.01, 0.1, 1, 10] - -# solution -import matplotlib.pyplot as plt -import seaborn as sns -from sklearn.inspection import DecisionBoundaryDisplay - -for C in Cs: - logistic_regression.set_params(logisticregression__C=C) - logistic_regression.fit(data_train, target_train) - accuracy = logistic_regression.score(data_test, target_test) - - DecisionBoundaryDisplay.from_estimator( - logistic_regression, - data_test, - response_method="predict", - cmap="RdBu_r", - alpha=0.5, - ) - sns.scatterplot( - data=penguins_test, - x=culmen_columns[0], - y=culmen_columns[1], - hue=target_column, - palette=["tab:red", "tab:blue"], - ) - plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left") - plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}") - -# %% [markdown] -# Look at the impact of the `C` hyperparameter on the magnitude of the weights. - -# %% -# solution -weights_ridge = [] -for C in Cs: - logistic_regression.set_params(logisticregression__C=C) - logistic_regression.fit(data_train, target_train) - coefs = logistic_regression[-1].coef_[0] - weights_ridge.append(pd.Series(coefs, index=culmen_columns)) - -# %% tags=["solution"] -weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs]) -weights_ridge.plot.barh() -_ = plt.title("LogisticRegression weights depending of C") - -# %% [markdown] tags=["solution"] -# We see that a small `C` will shrink the weights values toward zero. It means -# that a small `C` provides a more regularized model. Thus, `C` is the inverse -# of the `alpha` coefficient in the `Ridge` model. -# -# Besides, with a strong penalty (i.e. small `C` value), the weight of the -# feature "Culmen Depth (mm)" is almost zero. It explains why the decision -# separation in the plot is almost perpendicular to the "Culmen Length (mm)" -# feature. diff --git a/python_scripts/logistic_regression.py b/python_scripts/logistic_regression.py index 3156ebda0..45487341b 100644 --- a/python_scripts/logistic_regression.py +++ b/python_scripts/logistic_regression.py @@ -78,9 +78,7 @@ from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression -logistic_regression = make_pipeline( - StandardScaler(), LogisticRegression(penalty=None) -) +logistic_regression = make_pipeline(StandardScaler(), LogisticRegression()) logistic_regression.fit(data_train, target_train) accuracy = logistic_regression.score(data_test, target_test) print(f"Accuracy on test set: {accuracy:.3f}") @@ -124,8 +122,7 @@ # %% [markdown] # Thus, we see that our decision function is represented by a line separating -# the 2 classes. We should also note that we did not impose any regularization -# by setting the parameter `penalty` to `'none'`. +# the 2 classes. # # Since the line is oblique, it means that we used a combination of both # features: