Consider a simple regression function
. The parameters of the function are the slope
and the intercept
. The errors
associated with each data point are assumed to be independent and normally distributed with variance
. Given a sample data point
, the likelihood of the parameters
is specified as
. Maximum-likelihood estimation uses the joint-likelihood function of all the data points to learn the parameters of the regression line. Stochastic gradient descent uses each data point to iteratively update the estimated parameters of the algorithm by traversing the likelihood surface in the direction of the negative gradient of the likelihood of each point. The amount of travel in the direction of the point gradient is specified by the learning parameter
of the algorithm. The algorithm is as follows.
1. Choose a learning parameter
, an initial estimate of the parameters
2. Produce a random permutation of the data points.
3. For each point, compute
4. Repeat steps 2 and 3 until some convergence criterion is met.
 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction
, 2nd ed., New York: Springer, 2001.