This method looks at every example in the entire training set
Stochastic gradient descent works very well. The sum above is "replaced" by a loop over the training examples, so that the update becomes:
for i=1 to m:
θj:=θj+α(y(i)−θTx(i)xj(i))
Linear regression: recall that under mild assumptions, the explicit solution for the ordinary least squares can be written explicitly as:
θ∗=(XTX)−1XTY
where the linear model is written in matrix form Y=Xθ+ϵ, with Y=(y(1),...y(m))∈Rm×d
In the context of linear regression, the log-likelihood is often associated with the assumption of normally distributed errors. The typical formulation assumes that the response variable follows a normal distribution with a mean determined by the linear regression model. Here's how you can express the log-likelihood for a simple linear regression model:
Assuming the response variable (yi) for each observation is normally distributed with mean (μi) and constant variance (σ2), the likelihood function for the observed data (yi) given the linear regression model is:
L(β0,β1,σ2∣xi)=2πσ21exp(−2σ2(yi−μi)2)
where (μi=β0+β1xi) is the mean predicted by the linear regression model for the (i)-th observation.
The log-likelihood for the entire dataset ({y1,y2,…,yn}) is the sum of the log-likelihood contributions from each observation:
(β0) and (β1) are the coefficients of the linear regression model.
(n) is the number of observations.
The goal in linear regression is often to maximize this log-likelihood function, which is equivalent to minimizing the sum of squared residuals (ordinary least squares approach).
Note that in practice, maximizing the log-likelihood is often done under the assumption that the errors (yi−μi) are normally distributed, which allows for the use of maximum likelihood estimation (MLE). This assumption is a key aspect of classical linear regression.
Now let's take a look at logistic regression.
A natural (generalized) linear model for binary classification:
pθ(y=1∣x)=σ(θTx)pθ(y=0∣x)=1−σ(θTx)
where σ(z)=1+ϵ−z1 is the sigmoid function (or logistic function).
The compact formula is pθ(y∣x)=σ(θTx)y(1−θ(θTx))(1−y)
Logistic Regression:
L(θ)=i=1∏mσ(θTx(i)y(i))
There is no closed form formula for argmax(θ) so that we need now to use iterative algorithms. The gradient of the log likelihood is:
∂j∂l(θ)=i=1∑m(y(i)−σ(θTx(i))xj(i)
where we used the fact that σ′(z)=σ(z)(1−σ(z))
Now we will break down the binary cross entropy loss function the torch.nn.BCELoss computes Binary Cross Entropy between the target y=(y(1),...,y(m))∈0,1m and the output z=(z(1),...,z(m))∈[0,1]m as follows:
Note the default 1/m, where m is typically the size of the batch. This factor will be directly multiplied with the learning rate. Recall the batch gradient descent update:
θj:=θj+α∂θj∂loss(θ)
Softmax Regression: now we have c classes and for a training example (x,y), the quantity θkTx should be related to the probability for the target y to belong to class k. By analogy with the binary case, we assume:
logpθ(y=k∣x)≈θkTx,for allk=1,...,c.
as a consequence, we have with θ=(θ1,..,θc)∈R(dxc):
pθ(y=k∣x)=∑leθlTxeθkTx
and we can write it in vector form:
(pθ(y=k∣x)k=1,...,c=softmax(θ1Tx),...,θcTx)
where the sigmoid function is applied componentwise.
For the logistic regression, we had only one parameter θ whereas here, for two classes we have two parameters: θ1 and θ2.
For 2 classes, we recover the logistic regression:
pθ(y=1∣x)=e1Tx+e0Txe1Tx=1+eθ0T−θ1Tx1
Classification and softmax regression:
For the softmax regression, the log-likelihood can be written as: