Skip to content

Commit

Permalink
[ci skip] Rework ordering of linear models module (#701)
Browse files Browse the repository at this point in the history
Co-authored-by: ArturoAmorQ <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]> cbfedd1
  • Loading branch information
ogrisel and ArturoAmorQ committed Sep 1, 2023
1 parent e44a0f4 commit 999550c
Show file tree
Hide file tree
Showing 219 changed files with 4,667 additions and 12,900 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 0 additions & 5 deletions _sources/linear_models/linear_models_classification_index.md

This file was deleted.

2 changes: 1 addition & 1 deletion _sources/linear_models/linear_models_non_linear_index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Modelling non-linear features-target relationships
# Non-linear feature engineering for linear models

```{tableofcontents}
Expand Down
5 changes: 0 additions & 5 deletions _sources/linear_models/linear_models_regression_index.md

This file was deleted.

100 changes: 40 additions & 60 deletions _sources/python_scripts/linear_models_ex_02.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,100 +14,80 @@
# %% [markdown]
# # 📝 Exercise M4.02
#
# The goal of this exercise is to build an intuition on what will be the
# parameters' values of a linear model when the link between the data and the
# target is non-linear.
# In the previous notebook, we showed that we can add new features based on the
# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
# In that case we only used a single feature in `data`.
#
# First, we will generate such non-linear data.
# The aim of this notebook is to train a linear regression algorithm on a
# dataset with more than a single feature. In such a "multi-dimensional" feature
# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
# etc. Products of features are usually called "non-linear or
# multiplicative interactions" between features.
#
# ```{tip}
# `np.random.RandomState` allows to create a random number generator which can
# be later used to get deterministic results.
# ```

# %%
import numpy as np

# Set the seed for reproduction
rng = np.random.RandomState(0)

# Generate data
n_sample = 100
data_max, data_min = 1.4, -1.4
len_data = data_max - data_min
data = rng.rand(n_sample) * len_data - len_data / 2
noise = rng.randn(n_sample) * 0.3
target = data**3 - 0.5 * data**2 + noise
# Feature engineering can be an important step of a model pipeline as long as
# the new features are expected to be predictive. For instance, think of a
# classification model to decide if a patient has risk of developing a heart
# disease. This would depend on the patient's Body Mass Index which is defined
# as `weight / height ** 2`.
#
# We load the dataset penguins dataset. We first use a set of 3 numerical
# features to predict the target, i.e. the body mass of the penguin.

# %% [markdown]
# ```{note}
# To ease the plotting, we will create a Pandas dataframe containing the data
# and target
# If you want a deeper overview regarding this dataset, you can refer to the
# Appendix - Datasets description section at the end of this MOOC.
# ```

# %%
import pandas as pd

full_data = pd.DataFrame({"data": data, "target": target})
penguins = pd.read_csv("../datasets/penguins.csv")

# %%
import seaborn as sns
columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Body Mass (g)"

_ = sns.scatterplot(
data=full_data, x="data", y="target", color="black", alpha=0.5
)
# Remove lines with missing values for the columns of interest
penguins_non_missing = penguins[columns + [target_name]].dropna()

# %% [markdown]
# We observe that the link between the data `data` and vector `target` is
# non-linear. For instance, `data` could represent the years of experience
# (normalized) and `target` the salary (normalized). Therefore, the problem here
# would be to infer the salary given the years of experience.
#
# Using the function `f` defined below, find both the `weight` and the
# `intercept` that you think will lead to a good linear model. Plot both the
# data and the predictions of this model.


# %%
def f(data, weight=0, intercept=0):
target_predict = weight * data + intercept
return target_predict
data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]
data.head()

# %% [markdown]
# Now it is your turn to train a linear regression model on this dataset. First,
# create a linear regression model.

# %%
# Write your code here.

# %% [markdown]
# Compute the mean squared error for this model
# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
# as metric.

# %%
# Write your code here.

# %% [markdown]
# Train a linear regression model on this dataset.
#
# ```{warning}
# In scikit-learn, by convention `data` (also called `X` in the scikit-learn
# documentation) should be a 2D matrix of shape `(n_samples, n_features)`.
# If `data` is a 1D vector, you need to reshape it into a matrix with a
# single column if the vector represents a feature or a single row if the
# vector represents a sample.
# ```
# Compute the mean and std of the MAE in grams (g).

# %%
from sklearn.linear_model import LinearRegression

# Write your code here.

# %% [markdown]
# Compute predictions from the linear regression model and plot both the data
# and the predictions.
# Now create a pipeline using `make_pipeline` consisting of a
# `PolynomialFeatures` and a linear regression. Set `degree=2` and
# `interaction_only=True` to the feature engineering step. Remember not to
# include the bias to avoid redundancies with the linear's regression intercept.
#
# Use the same strategy as before to cross-validate such a pipeline.

# %%
# Write your code here.

# %% [markdown]
# Compute the mean squared error
# Compute the mean and std of the MAE in grams (g) and compare with the results
# without feature engineering.

# %%
# Write your code here.
81 changes: 35 additions & 46 deletions _sources/python_scripts/linear_models_ex_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,24 +14,14 @@
# %% [markdown]
# # 📝 Exercise M4.03
#
# In the previous notebook, we showed that we can add new features based on the
# original feature to make the model more expressive, for instance `x ** 2` or `x ** 3`.
# In that case we only used a single feature in `data`.
# The parameter `penalty` can control the **type** of regularization to use,
# whereas the regularization **strength** is set using the parameter `C`.
# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
# this exercise, we ask you to train a logistic regression classifier using the
# `penalty="l2"` regularization (which happens to be the default in
# scikit-learn) to find by yourself the effect of the parameter `C`.
#
# The aim of this notebook is to train a linear regression algorithm on a
# dataset with more than a single feature. In such a "multi-dimensional" feature
# space we can derive new features of the form `x1 * x2`, `x2 * x3`,
# etc. Products of features are usually called "non-linear or
# multiplicative interactions" between features.
#
# Feature engineering can be an important step of a model pipeline as long as
# the new features are expected to be predictive. For instance, think of a
# classification model to decide if a patient has risk of developing a heart
# disease. This would depend on the patient's Body Mass Index which is defined
# as `weight / height ** 2`.
#
# We load the dataset penguins dataset. We first use a set of 3 numerical
# features to predict the target, i.e. the body mass of the penguin.
# We start by loading the dataset.

# %% [markdown]
# ```{note}
Expand All @@ -42,52 +32,51 @@
# %%
import pandas as pd

penguins = pd.read_csv("../datasets/penguins.csv")
penguins = pd.read_csv("../datasets/penguins_classification.csv")
# only keep the Adelie and Chinstrap classes
penguins = (
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
)

columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Body Mass (g)"
culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"

# Remove lines with missing values for the columns of interest
penguins_non_missing = penguins[columns + [target_name]].dropna()
# %%
from sklearn.model_selection import train_test_split

data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]
data.head()
penguins_train, penguins_test = train_test_split(penguins, random_state=0)

# %% [markdown]
# Now it is your turn to train a linear regression model on this dataset. First,
# create a linear regression model.
data_train = penguins_train[culmen_columns]
data_test = penguins_test[culmen_columns]

# %%
# Write your code here.
target_train = penguins_train[target_column]
target_test = penguins_test[target_column]

# %% [markdown]
# Execute a cross-validation with 10 folds and use the mean absolute error (MAE)
# as metric.
# First, let's create our predictive model.

# %%
# Write your code here.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# %% [markdown]
# Compute the mean and std of the MAE in grams (g).

# %%
# Write your code here.
logistic_regression = make_pipeline(
StandardScaler(), LogisticRegression(penalty="l2")
)

# %% [markdown]
# Now create a pipeline using `make_pipeline` consisting of a
# `PolynomialFeatures` and a linear regression. Set `degree=2` and
# `interaction_only=True` to the feature engineering step. Remember not to
# include the bias to avoid redundancies with the linear's regression intercept.
#
# Use the same strategy as before to cross-validate such a pipeline.
# Given the following candidates for the `C` parameter, find out the impact of
# `C` on the classifier decision boundary. You can use
# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
# decision function boundary.

# %%
Cs = [0.01, 0.1, 1, 10]

# Write your code here.

# %% [markdown]
# Compute the mean and std of the MAE in grams (g) and compare with the results
# without feature engineering.
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.

# %%
# Write your code here.
92 changes: 0 additions & 92 deletions _sources/python_scripts/linear_models_ex_04.py

This file was deleted.

Loading

0 comments on commit 999550c

Please sign in to comment.