Estimating the Local Mean Function
The linear regression model relies on a number of crucial assumptions, the most important of which is a linear relationship between the covariate and the dependent variable . One way of generalizing the standard linear regression model is to extend it to the nonlinear form , where and are real-valued functions and is an independent, identically distributed sequence of standard normal random variables. We study five different sets of data (three "real" and two simulated) to illustrate how kernel-weighted least-squares regression can be used to estimate the local mean function .
There is an extensive literature on nonlinear regression and nonparametric estimation techniques. We refer the reader to the notable book by Fan and Yao and extensive work by Bjerve and Doksum. Roughly speaking, the procedure to estimate can be implemented in the following fashion. First, take a set of evenly spaced design points over an interior interval of the empirical support of the covariate . Then, at each design point, solve a kernel-weighted least squares problem to locally fit a polynomial of order . (In this Demonstration, the local fit is parabolic.)
By "kernel-weighted", we mean that the data are weighted according to the Epanechnikov kernel , where , is a design point, and is the "kernel bandwidth". The variable effectively controls the amount of nearby data that are permitted to influence the estimate of (and its derivatives) locally. There are a variety of techniques or heuristics available to choose . You can vary the size of the bandwidth. Smaller bandwidths reveal too many of the local features of the data, perhaps, and larger bandwidths oversmooth the data.
By solving a kernel-weighted least squares regression at each design point, we obtain an estimate of the value of and its first two derivatives at each design point. We then have all the information we need to fit a spline.
The "cubic" set of data is a simulated set formed by generating 600 realizations of with a standard normal distribution and then simulating , where are independently simulated standard normal random variables.
The "sine" set of data is another simulated set, formed by first generating 600 realizations of with a uniform distribution over the interval . We then obtain simulations of according to the rule , where are independently simulated standard normal random variables.
The "baseball" data consists of performance data for all regular major league baseball players during the 1999 baseball season. We compute the overall proportion of hits to at-bats, and the proportion of hits to at-bats when there is a teammate in scoring position. The two proportions are obviously positively correlated, but the nonlinear regression model offers a potentially more useful fit than the usual linear regression model. These data were taken from John Rasp's website.
The "body fat" data were also taken from John Rasp's website. The data were taken from a sample of 252 men. The covariate is the weight (in pounds) of the male subject, and the dependent variable is the body fat percentage (obtained through an underwater weighing procedure).
The "stock" data consists of U.S. equity returns ( variable) and French equity returns ( variable) for 1000 trading days in the late 1990s.