Kernel Density Estimation
Requires a Wolfram Notebook System
Interact on desktop, mobile and cloud with the free Wolfram Player or other Wolfram Language products.
Histograms are a useful but limited way to estimate or visualize the true, underlying density of some observed data with an unknown distribution. Histograms are essentially discontinuous step functions. So, if you believe that observed data is generated by a continuous density—or even a differentiable density—then another histogram-like estimation procedure might be preferable.[more]
A kernel histogram is a generalization of the usual histogram. It associates to each data point a function (called a kernel function). The kernel histogram is the (properly renormalized) sum of these functions. Kernel functions typically depend on a parameter, usually called the bandwidth, that significantly affects the roughness or smoothness of the kernel histogram that is ultimately generated. (Somewhat confusingly, kernel functions are themselves density functions.)
Choose a target distribution from which to generate random data, as well as a type of kernel function. For example, the Epanechnikov kernel has certain asymptotic properties that make it a highly desirable kernel, though you can obtain a kernel histogram very much like the usual histogram by choosing a uniform kernel. Add additional random realizations from the target distribution and watch the kernel histogram converge to the true, underlying density.[less]
Contributed by: Jeff Hamrick (March 2011)
Open content licensed under CC BY-NC-SA
The author's interest in kernel estimation techniques stems from a recent paper in which the author used similar techniques to nonparametrically estimate the function in the stochastic differential equation , where is a standard Brownian motion. However, kernel estimation techniques are also used, for example, to estimate the functions in the nonlinear regression equation , where is an independent, identically distributed sequence such that . There are numerous applications of kernel estimation techniques, including the density estimation technique featured in this Demonstration. For more information about kernel density estimation, see the Wiki entries.
Several lessons about kernel histograms can be learned quickly from this Demonstration. First, notice that when the number of data is quite small (before you start adding lots of additional data points), you can see the kernel functions quite clearly. Moreover, it is not easy to see how the kernel functions are "estimating" the true underlying density.
Continue to add new data and notice that making the bandwidth small reveals a great deal about the random data that has been generated according to the law of the selected target distribution. However, making the bandwidth small also makes the resulting kernel histogram rather unbelievable. Making the bandwidth very large smooths out the wrinkles in the kernel histogram, but may result in a kernel histogram that does not retain any unusual or interesting features of the data.
Next, notice that while the kernel histogram is converging to the true, underlying density, the rate of convergence does not seem fast. However, it has been shown that if the true, underlying distribution of the data is sufficiently smooth, the rate of convergence in an sense is . In other words, kernel histograms converge at a rate that is faster than the analogous rate of convergence in the central limit theorem (see Kolmogorov's addendum to the Glivenko-Cantelli theorem for additional information).
The kernel histograms that we generate in this Demonstration have not been adjusted for any underlying assumptions regarding the support of the target distribution. Consider the use of a kernel histogram to estimate an exponential density. An exponential random variable assumes negative values with zero probability, but virtually all kernel histograms used to estimate an exponential density are strictly positive to the left of zero.
There are techniques, however, to manage this undesirable property. There are also techniques to "optimally" choose the bandwidth, even without knowing the underlying distribution of the data. For more information, see Fan and Yao (2003) or Bradley and Taqqu (2003).