Introduction

Logistic Regression is a statistical method to predict qualitative response using one or more independent variables. Instead of predicting the response directly, Logistic Regression models probability that the response belongs to a particular category. In logistic regression, we use logistic function.

The logistic function will always producr an S shaped curve, so we always get a sensible prediction. $p(X)$ is sometimes also called sigmoid function.

$$ \log\Big(\frac{p(X)}{1-p(X)}\Big) = w^TX + c $$

$$z = w^TX + c $$

$$ p(X) = \frac{1}{1 + e^{-z}} $$

#collapse-hide
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize

#collapse-hide
def sigmoid(z):
    '''Sigmoid function maps from a value between 0 and 1'''
    return 1/(1+np.exp(-z))

x = np.linspace(-10,10,100)
y = sigmoid(x)
plt.plot(x,y)
plt.title("Sigmoid Function")
plt.show()

Estimate Regression Coefficients

A method called maximum likelihood is to estimate the unknown coefficients. The idea behind this method is to estimate coefficients such that predicted probability $\hat{p}(x_i)$ of each observation corresponds closely as possible to the observed value of response variable for the same observation. In other words, we estimate coefficients such that $p(X)$ yields a number close to 1 for the observations with actual response value 1 and a number close to 0 for the observations with actual response value 0. The mathematical eequation for the likelihood function is

$$ \ell(W) = \prod_{i;y_i = 1} p(x_i) \prod_{i;y_i = 0} ( 1 - p(x_i)) $$

The $W$ estimates are chosen to maximize the likelihood function.

Manipulating the above constraint, we arrive at Cross Entropy Loss for binary classification case. \begin{equation} L(y,\hat{y}) = -\frac{1}{N} \sum_{i=1}^N y_i \log(\hat{y_i}) + (1-y_i)\log(\hat{y_i}) \end{equation}

def cross_entropy_loss(w,x,y):
    '''Cross entropy loss function'''
    z = np.dot(x,w)
    h = sigmoid(z)
    total_loss = np.sum(-y*np.log(h) - (1-y)*np.log(1-h)).mean()
    return total_loss

Use chain rule to calculate gradient of loss

\begin{equation} \frac{\partial{L}}{\partial{w_i}} = \frac{\partial{L}}{\partial{\hat{y}}} \frac{\partial{\hat{y}}}{\partial{z}} \frac{\partial{z}}{\partial{w}} \end{equation}

Examining each factor in turn

\begin{equation} \frac{\partial{L}}{\partial{\hat{y_i}}} = \frac{-y_i}{\hat{y_i}} + \frac{1-y_i}{1-\hat{y_i}} \end{equation}\begin{equation} \frac{\partial{L}}{\partial{\hat{y_i}}} = \frac{\hat{y_i}-y_i}{\hat{y_i}(1-\hat{y_i})} \end{equation}\begin{equation} \frac{\partial{\hat{y}}}{\partial{z}} = \hat{y_i}(1-\hat{y_i}) \end{equation}\begin{equation} \frac{\partial{z}}{\partial{w}} = x_i \end{equation}

Multiplying all the indivdual terms, you get

\begin{equation} \frac{\partial{L}}{\partial{w_i}} = (\hat{y_i}-y_i)x_i \end{equation}

def gradient(w,x,y):
    '''Gradient for cross entropy loss'''
    z = np.dot(x,w)
    h = sigmoid(z)
    gradient = np.dot(x.T,h - y)
    return gradient

We are gonnna use a algorithm called batch gradient descent to find the optimal weights.

$$ \Delta w_{i} = -\eta \frac{\partial L}{\partial w_{i}} $$

$$ w_{i} := w_{i} + \Delta w_{i} $$

def update_weights(learning_rate = 0.01, n_iters = 1000, loss_threshold = 0.001):
    ## Initialize the weights with zero values
    w = np.random.rand(x.shape[1])

    for epoch in range(n_iters):
        loss = cross_entropy_loss(w,x,y)
        grad = gradient(w,x,y)
        w = w - learning_rate * grad
        if epoch % 1000 == 0:
            print(f"Loss after iteration {epoch} is {round(loss,2)}")
        if loss < loss_threshold:
            break
    return(w)

def accuracy(true, probs, threshold = 0.5):
    predicted = (probs > threshold).astype(int)
    return 100*(predicted == true).mean()

def predict(w,x):
    return sigmoid(np.dot(x,w))

Lets load some binary classification data from sklearn.datasets and split the data into train test split.

cancer_data = load_breast_cancer()
x, x_test, y, y_test = train_test_split(cancer_data.data, 
                                                    cancer_data.target, 
                                                    test_size=0.25, 
                                                    random_state=0)
x = normalize(x)

Lets call the update weights function to find the optimal weights that minimizes cross entropy loss

w = update_weights(learning_rate = 0.05, n_iters = 10000, loss_threshold = 0.001)

Loss after iteration 0 is 281.55
Loss after iteration 1000 is 87.93
Loss after iteration 2000 is 83.59
Loss after iteration 3000 is 81.78
Loss after iteration 4000 is 80.63
Loss after iteration 5000 is 79.76
Loss after iteration 6000 is 79.06
Loss after iteration 7000 is 78.46
Loss after iteration 8000 is 77.94
Loss after iteration 9000 is 77.47

Lets use the model to generate predictions for the test data and see what kind of accuracies are we reaching.

preds = predict(w,normalize(x_test))
acc = accuracy(y_test, preds)
print(f"Accuracy on test data is {round(acc,3)}")

Accuracy on test data is 93.706