Finding a local minimum, possibly using the Wolfram Language FindMinimum function, is a basic numerical problem at the heart of neural network training. Finding a local minimum is usually done by calculating gradients at neighboring initial points and descending along them (gradient descent). The object is then to optimize convergence speed. Calculated gradients are often inaccurate (simulated here by adding Gaussian noise), suggesting the use of some averaging procedure (e.g. the momentum method).
This Demonstration allows testing of five methods on five popular two-dimensional functions (with minimum shifted to ). SGD stands for stochastic gradient descent and ADAM for Adaptive Moment Estimation.
You can draw trajectories in 2D or 3D and evaluate on a lattice of starting points. Similar basic standard methods include vanilla SGD and momentum. However, these are very inefficient and require a tiny learning rate. There is also the very popular ADAM method that gives much better results.
Two additional methods are also considered: OGR (online gradient regression, see arXiv:1901.11457) and using a Newton-step method with second derivative to estimate the observed sequence of gradients. Diagonal Hessian variant cdOGR models consider a separate parabola for each canonical direction.

### DETAILS

Use the "mode" buttons to choose 2D or 3D view of trajectories. Evaluation is shown on sorted values from lattice of initial points.
Use the 2D slider "start" to choose the starting point.
The "steps" slider sets the number of steps used for all methods.
Use the "range" slider to set the range shown in 2D and 3D plots and the lattice of starting points in evaluation mode. To prevent escape to infinity, positions are enforced to be in a ball of radius .
Use the "noise" slider to control standard deviation of random Gaussian noise added to the gradient.
Use the "seed" slider to choose the random number generator seed for this noise.
Then you can choose one of five popular 2D test functions from [1], which were shifted to have 0 minimum in (0,0), marked with a red "X".
Finally, you can tune hyperparameters of the methods to investigate their dependence. Source code prevents very large steps in ADAM or OGR.
Vanilla SGD uses only " mom" learning rate; momentum additionally has gradient averaging—with adaptation rate controlled by . ADAM uses "" learning rate and "", "" adaptation rates.
OGR has "" similar to the learning rate, for which would mean jumping to the modeled minimum of a parabola. It has also two adaptation rates: "" for averaged gradient in Newton step, and "" for the Hessian estimator. We could assume , simplifying and reducing computational cost; however, for generality, they are separated here, with the suggestion to use .
References
[1] Wikipedia. "Test Functions for Optimization." (Mar 7, 2023) en.wikipedia.org/wiki/Test_functions_for_optimization.
[2] S. Ruder. "An Overview of Gradient Descent Optimization Algorithms." (Mar 7, 2023) www.ruder.io/optimizing-gradient-descent.
[3] J. Duda, "Improving SGD Convergence by Online Linear Regression of Gradients in Multiple Statistically Relevant Directions." arxiv.org/abs/1901.11457.
[4] JarekDuda. "SGD-OGR-Hessian-estimator." (Mar 7, 2023) github.com/JarekDuda/SGD-OGR-Hessian-estimator.

### PERMANENT CITATION

 Share: Embed Interactive Demonstration New! Just copy and paste this snippet of JavaScript code into your website or blog to put the live Demonstration on your site. More details » Download Demonstration as CDF » Download Author Code »(preview ») Files require Wolfram CDF Player or Mathematica.