Loss Functions for Classification

Table of Contents

  1. Minimal Working Examples
    1. BCELoss
    2. NLLLoss and CrossEntropyLoss

General ML Classification Tasks

In this blog, general machine learning classification task will be covered.

Some supervised learning basics

The training steps pretty much consist of the below flow:

Maximizing the log likelihood in the training step.

A Probabilistic Model

The dataset is made of mm training examples (x(i),y(i))i[m](x(i), y(i))_{i\in[m]}, where

L(θx)=logL(θ)=mlog(σ2π)12σ2i=1m(y(i)θTx(i))2\mathcal{L}(\theta \mid x) = \log L(\theta) = -m\log (\sigma \sqrt{2\pi}) - \frac{1}{2\sigma^{2}}\sum_{i=1}^{m}(y(i) - \theta^T x(i))^2

And the Jacobian vector product or cost function is :

J(θ)=12i=1(y(i)θTx(i))2 J(\theta) = \frac{1}{2}\sum_{i=1}(y(i) - \theta^Tx(i))^2

giving rise to the ordinary lease squares regression model,

The gradient of the least squares cost function is:

θJ(θ)=i=0m(y(i)θTx(i))θj(y(i)i=0dθkxk(i))=i=0m(y(i)θTx(i))=i=0m(y(i)=θTx(i))xj(i)\frac{\partial}{\partial\theta} J(\theta) = \sum_{i=0}^{m} (y(i) - \theta^T x(i)) \frac{\partial}{\partial \theta_{j}} (y(i) - \sum_{i=0}^{d} \theta_{k}x_{k}(i)) = \sum_{i=0}^{m} (y(i) - \theta^{T} x(i)) = \sum_{i=0}^{m} (y(i) = \theta^{T} x(i))x_j(i)

Gradient Descent Algorithms

Batch gradient descent performs the update

θj:=θj+αi=0m(y(i)θTx(i)xj(i)) \theta_{j} := \theta_{j} + \alpha \sum_{i=0}^{m} (y(i) - \theta^Tx(i)x_{j}(i))

where α\alpha is the learning rate,

This method looks at every example in the entire training set

Stochastic gradient descent works very well. The sum above is "replaced" by a loop over the training examples, so that the update becomes:

for i=1i = 1 to mm:

θj:=θj+α(y(i)θTx(i)xj(i))\theta_{j} := \theta_{j} + \alpha (y(i) - \theta_{T}x(i)x_{j}(i))

Linear regression: recall that under mild assumptions, the explicit solution for the ordinary least squares can be written explicitly as:

θ=(XTX)1XTY\theta^{*} = (X^TX)^{-1}X^{T}Y

where the linear model is written in matrix form Y=Xθ+ϵY = X\theta + \epsilon, with Y=(y(1),...y(m))Rm×dY = (y(1),...y(m)) \in \mathbb{R}^{m \times d}

In the context of linear regression, the log-likelihood is often associated with the assumption of normally distributed errors. The typical formulation assumes that the response variable follows a normal distribution with a mean determined by the linear regression model. Here's how you can express the log-likelihood for a simple linear regression model:

Assuming the response variable (yi)(y_i) for each observation is normally distributed with mean (μi)(\mu_i) and constant variance (σ2)(\sigma^2), the likelihood function for the observed data (yi)(y_i) given the linear regression model is:

L(β0,β1,σ2xi)=12πσ2exp((yiμi)22σ2) L(\beta_0, \beta_1, \sigma^2 \mid x_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu_i)^2}{2\sigma^2}\right)

where (μi=β0+β1xi)(\mu_i = \beta_0 + \beta_1 x_i) is the mean predicted by the linear regression model for the (i)(i)-th observation.

The log-likelihood for the entire dataset ({y1,y2,,yn})(\{y_1, y_2, \ldots, y_n\}) is the sum of the log-likelihood contributions from each observation:

L(β0,β1,σ2xi)=n2log(2πσ2)12σ2i=1n(yiμi)2 \mathcal{L}(\beta_0, \beta_1, \sigma^2 \mid x_i) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \mu_i)^2

Here:

The goal in linear regression is often to maximize this log-likelihood function, which is equivalent to minimizing the sum of squared residuals (ordinary least squares approach).

Note that in practice, maximizing the log-likelihood is often done under the assumption that the errors (yiμi)(y_i - \mu_i) are normally distributed, which allows for the use of maximum likelihood estimation (MLE). This assumption is a key aspect of classical linear regression.

Now let's take a look at logistic regression.

A natural (generalized) linear model for binary classification:

pθ(y=1x)=σ(θTx) p_{\theta}(y=1 | x) = \sigma(\theta^{T}x) pθ(y=0x)=1σ(θTx) p_{\theta}(y=0 | x) = 1 - \sigma(\theta^{T}x)

where σ(z)=11+ϵz\sigma(z) = \frac{1}{1+\epsilon^{-z}} is the sigmoid function (or logistic function).

insert the graph here TODO

The compact formula is pθ(yx)=σ(θTx)y(1θ(θTx))(1y)p_{\theta}(y|x) = \sigma(\theta^{T}x)^y(1 - \theta(\theta^{T}x))^{(1-y)}

Logistic Regression:

L(θ)=i=1mσ(θTx(i)y(i)) L(\theta) = \prod_{i=1}^{m} \sigma(\theta^{T}x(i)^{y(i)})

There is no closed form formula for argmax(θ)argmax \mathcal(\theta) so that we need now to use iterative algorithms. The gradient of the log likelihood is:

jl(θ)=i=1m(y(i)σ(θTx(i))xj(i)\frac{\partial}{\partial_{j}} \mathcal{l}(\theta) = \sum_{i=1}^{m}(y(i) - \sigma(\theta^Tx(i))x_{j}(i)

where we used the fact that σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))

Now we will break down the binary cross entropy loss function the torch.nn.BCELoss computes Binary Cross Entropy between the target y=(y(1),...,y(m))0,1my = (y(1),...,y(m)) \in {0, 1}^{m} and the output z=(z(1),...,z(m))[0,1]mz = (z(1),...,z(m)) \in [0,1]^{m} as follows:

loss(i)=[y(i)logz(i)+(1y(i))log(1z(i))] \text{loss}(i) = -[\text{y}(i)\text{log} \text{z}(i) + (1 - \text{y}(i)) \text{log}(1 - \text{z}(i))] BCELoss(z,y)=1mi=1mloss(i) \text{BCELoss}(z,y) = \frac{1}{m} \sum_{i=1}^{m} \text{loss}(i)

In summary, we get

BCEWithLogitsLoss(z,y)=BCELoss(σ(z),y)\text{BCEWithLogitsLoss}(z,y) = \text{BCELoss}(\sigma(z),y)

The version is more numerically stable.

Note the default 1/m1/m, where mm is typically the size of the batch. This factor will be directly multiplied with the learning rate. Recall the batch gradient descent update:

θj:=θj+αθjloss(θ)\theta_{j} := \theta_{j} + \alpha \frac{\partial}{\partial \theta_{j}} \text{loss}(\theta)

Softmax Regression: now we have cc classes and for a training example (x,y), the quantity θkTx\theta_{k}^{T}x should be related to the probability for the target yy to belong to class kk. By analogy with the binary case, we assume:

logpθ(y=kx)θkTx,for allk=1,...,c. \text{log} p_{\theta}(y = k | x) \approx \theta_{k}^{T}x, \text{for all} k = 1,...,c.

as a consequence, we have with θ=(θ1,..,θc)R(dxc)\theta = (\theta_{1},..,\theta_{c}) \in \mathbb{R}^{(dxc)}:

pθ(y=kx)=eθkTxleθlTxp_{\theta}(y = k | x) = \frac{e^{\theta_{k}^{T}x}}{\sum_{l} e^{\theta_{l}^{T}}x}

and we can write it in vector form:

(pθ(y=kx)k=1,...,c=softmax(θ1Tx),...,θcTx)(p_{\theta}(y = k | x)_{k=1,...,c} = softmax(\theta_{1}^{T}x),...,\theta_{c}^{T}x)

where the sigmoid function is applied componentwise.

For the logistic regression, we had only one parameter θ\theta whereas here, for two classes we have two parameters: θ1\theta_{1} and θ2\theta_{2}.

For 2 classes, we recover the logistic regression:

pθ(y=1x)=e1Txe1Tx+e0Tx p_{\theta} ( y = 1 | x) = \frac{e_{1}^{T}x}{e_{1}^{T}x + e_{0}^{T}x} =11+eθ0Tθ1Tx = \frac{1}{1 + e^{\theta_{0}^{T} - \theta_{1}^{T}x}}

Classification and softmax regression:

For the softmax regression, the log-likelihood can be written as:

l(θ)=i=1mk=1c(y(i)=k)log(eθkTx(i)leθlTx(i)) \mathcal{l}(\theta) = \sum_{i=1}^{m} \sum_{k=1}^{c}(y(i)=k)\text{log}\left( \frac{e^{\theta_{k}^{T}}x(i)}{\sum_{l} e^{\theta_{l}^{T}}x(i)} \right) =i=1Tlog softmaxy(i)(θ1T,...,θcTx(i)) = \sum_{i=1}^{T} \text{log softmax}_{y(i)}(\theta_{1}^{T} ,..., \theta_{c}^{T}x(i))

In PyTorch, if the last layer of your network is a LogSoftmax() function, then you can od a softmax regression with the torch.nn.NLLoss().

Minimal Working Examples

BCELoss

import torch.nn as nn
m = nn.Sigmoid()
loss = nn.BCELoss()
input = torch.randn(3,4,5)
target = torch.randn(3,4,5)
loss(m(input), target)

NLLLoss and CrossEntropyLoss

import torch.nn as nn
m = nn.LogSoftmax(dim=1)
loss1 = nn.NLLLoss()
loss2 = nn.CrossEntropyLoss()
C = 8
input = torch.randn(3,C,4,5)
target = torch.empty(3,4,5 dtype=torch.long).random_(0,C) 
assert loss1(m(input),target) == loss2(input,target)