For the code here, you need a few imports:
import numpy as np
from matplotlib import pyplot as plt
Logistic Regression is, despite the name, a classifier. Its procedure fits (hence the reference to regression) the logistic function, or sigmoid,
see the plot down here.
x = np.arange(-10, 10, 1)
y = 1. / (1 + np.exp(-x))
plt.plot(x, y, color='g', label='$1/(1 + e^{-x})$')
plt.title('A sigmoid function')
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.legend()
plt.show();
This exposition is inspired and re-worked from these notes from a Stanford course.
The idea is: given a categorical variable$$y \in {0, 1}$$and some independent variable$$x$$which we want to use in order to classify$$y$$, we could think of running a linear regression$$y = mx + b$$followed by a classification (say if$$mx \geq 0.5$$we classify$$y$$as 1 and if$$mx < 0.5$$ we classify it as 0).
This would work fine if we were in the case displayed in the left side of the pictures here. But if, as in the right side, the training set is such that a point is far away from the rest, this point would be, with this procedure, classified as a 0!
The logistic regression uses a sigmoid function model:
where$$0 \leq h \leq 1$$can be interpreted as the probability that$$y=1$$given$$x$$as the input and$$\theta$$the parameters:
A logistic regression is a linear classifier: it predicts$$y=1$$when
Suppose we find parameters (we are in multiple dimensions)
we would predict
Note though, that if the boundary given by the training data is not linear, we should use a logistic regression with higher order polynomials, as in$$h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_2^2)$$, with$$g$$being the logistic function (the sigmoid). As in a polynomial regression, higher order features can be treated as first order ones with a substitution.
Now, how do we find the coefficients$$\bar{\theta}$$? We use a cost function as in the regular linear regression.
Given a training set with$$m$$samples, in a multi-dimensional space,$${(\bar{x}^1, y^1), \cdots, ((\bar{x}^m, y^m)}$$, where we have$$n$$features for each sample, so$$\bar x$$is a vector in$$\mathbb{R}^n$$and$$y \in {0, 1}$$, the model is
The cost function minimised in the case of linear regression would be a non-convex function with many local minima in the case of the logistic model so a gradient descent would not find the global minimum. Instead, the cost function we use is (illustrated in figure):
This cost function captures the intuition that if
At this point a gradient descent (see page) is used to compute the minimum over parameters
Logistic Regression is a case of a Generalised Linear Model: the predictor function is indeed linear in the input variable:
with $$h$$being interpreted as a probability as explained above.
{% page-ref page="../../machine-learning-concepts-and-procedures/learning-algorithms/the-gradient-descent-method.md" %}
- D R Cox, The regression analysis of binary sequences, Journal of the Royal Statistical Society B, 20:2, 1958
- Notes on linear and logistic regression from the Stanford ML course by A Ng