Nonparametric Regression and Kernel Smoothing: Confidence Regions for the L2-Optimal Curve Estimate

Initializing live version
Download to Desktop

Requires a Wolfram Notebook System

Interact on desktop, mobile and cloud with the free Wolfram Player or other Wolfram Language products.

This Demonstration considers one of the simplest nonparametric-regression problems: how to "optimally denoise" a noisy signal, assuming it is a smooth curve plus a realization of i.i.d. centered Gaussian variables of known variance . The setting is the same as in the Demonstration "Nonparametric Curve Estimation by Kernel Smoothers: Efficiency of Unbiased Risk Estimate and GCV Selectors". Recall that applying a classical kernel-smoothing method to a noisy signal actually produces a trajectory of candidate estimates of the underlying curve, each estimate being associated with (or tuned by) a bandwidth value, denoted . Ideally one would like to pick "the optimal " given the data. A classical approach is to attempt (as in the previously mentioned Demonstration) to minimize over the average of the squared errors , which is a discrete version of the distance between the underlying (true) curve and the curve estimate obtained with (see Details). The unknown true minimizer (resp. the associated curve estimate) is called the ASE-optimal bandwidth (resp. the -optimal curve estimate).

[more]

This Demonstration implements a simulation-based method, proposed in [1], for constructing approximate confidence intervals for the ASE-optimal bandwidth, and shows why Mathematica's built-in Manipulate function is especially appropriate to inferences that are made possible by such intervals (e.g. to give an answer to the question, "Is the -optimal curve estimate rather accurately located with a confidence of, say, 90%?").

The data processing in this program assumes that is known, so the criterion (unbiased estimate of ) is then available in place of GCV. This restriction is for simplicity because the augmented-randomization would otherwise require an estimate of and a number of quite different methods are available for such an estimate.

For a given percentile number , chosen between 1 and 50, such a construction consists of: 1. simulating a large number (here 1000) of randomized- criteria (to be more precise, each criterion uses a randomized-trace function with "augmented randomness" as defined in [1], and is referred to as an ARCL criterion), 2. minimizing each of these 1000 criteria, and 3. taking the and percentiles of the population of the 1000 minimizers.

It is shown in [1] that the percentiles of the population that could be obtained in theory at step 2 if the number of simulations (here 1000) were increased to infinity, yields asymptotically (i.e. as the size of the dataset tends to infinity) conservative statements.

Precisely, assuming that the asymptotic regime is attained, the (respectively ) percentile will be less (resp. greater) than the ASE-optimal bandwidth with a probability at least . This last probability is thus the so-called nominal coverage probability, denoted by or . Notice that we only consider the case of an identical nominal coverage for the lower bound and the upper bound (equal-tail interval). The (asymptotically guaranteed) coverage of the two-sided interval is thus .

In this Demonstration, if is chosen between 50 and 99, then plays the role of in the previous paragraph.

The true curve can be chosen among six possible choices and the size of every simulated set of data is fixed at 1024 (although this can be modified in the program). You can vary the "noise level" from a relatively small level (with respect to the "magnitude" of the true curve, here about 1) to a rather large one. You can select the checkbox "include data in the plot".

You can fix by choosing for it to be among typical percentile values, say 95 (and or are thus fixed at 0.95), and clicking the tab with label ARCL percentile curve estimate ...". You can then see the first view of the results (detailed below) obtained when analyzing the dataset generated with seed set to 1.

Now, clicking the "Play" button for the slider "seed for data generation" repeats, for a succession of simulated datasets, the following three tasks:

• first the confidence interval for the ASE-optimal bandwidth is constructed (steps 1, 2, and 3 above run rather quickly in this program thanks a Fourier domain implementation and some precomputation);

• next the two "bounding kernels at level %", corresponding to the lower and upper limits of the confidence interval, are plotted in black in the additional panel, top right of the view 1 (also plotted in the main panel, in black, is the ARCL-percentile curve estimate; that is the one with bandwidth equal to the percentile of the 1000 bandwidths obtained in step 3 above);

• and the two empirical coverages are updated by simply checking for the current dataset whether or not the ASE-optimal kernel (plotted in dashed green) is really bounded by the above bounding kernels.

After having automatically processed, say, a few hundred datasets that differ only by their seed, this can first demonstrate that the constructed confidence regions are often very good (see the Details below): indeed, we have experienced that in the few cases when such a region is not conservative, the empirical coverage converges (as more and more seeds are processed) to a value lower than the nominal probability, but only by a small amount.

In addition, the animation so obtained in the first view by clicking the "Play" button also allows an assessment of the variability of the two bounding kernels.

Next, and perhaps more importantly, by clicking the "Pause" button, selecting the second view, and moving the slider "trial bandwidth ", you can observe that a "large" difference between the lower bound and the upper bound (with a coverage probability fixed at, say, 0.90) for the ASE-optimal bandwidth, does not necessarily imply a large difference between the corresponding curves that delimit the confidence region (curves in orange) for the ASE-optimal curve estimate (see Details).

The way the two coverages are updated is coded similarly to the Demonstration "How Do Confidence Intervals Work?", except that the statement (viz. the constructed "bounding" bandwidth is a true bound) is with respect to an unknown "parameter" (the ASE-optimal bandwidth) that varies here from seed to seed, whereas it is a constant (like the usual "mean value") in that Demonstration.

[less]

Contributed by: Didier A. Girard (March 2013)
(CNRS-LJK and University Joseph Fourier, Grenoble)
Open content licensed under CC BY-NC-SA


Snapshots


Details

See the Demonstration "Nonparametric Curve Estimation by Kernel Smoothers: Efficiency of Unbiased Risk Estimate and GCV Selectors" for the definitions of the setting and the distance when a noisy signal is given. In that Demonstration, the now classical cross-validation (identical to GCV here) choice in the family of candidate curve estimates is computed and compared with the curve estimate using the ASE-optimal bandwidth.The value that minimizes the unbiased risk estimate Mallows' (also denoted UBR) is also computed and it proves to be very close to the GCV choice in many cases.

Snapshot 1: Selecting as the true function underlying the data, setting , and using the and percentiles as lower and upper bounds (select or , with the only difference that the plotted ARCL percentile curve estimate is either the one associated with the lower bound or the upper bound), we can find after a few hundreds of simulations that the empirical values and (not shown here) converge respectively to 97.2 and 94.8, which means that the coverage probability is very close the nominal coverage probability 0.95 for the upper bound and slightly conservative for the lower bound.

Snapshot 2: Selecting as the true function, setting , and still selecting , we could find after enough simulations that the two empirical values converge to values close to 0.95 (even more than in the above case), but furthermore, during this simulation, we can see in the top inset of this first view that the two kernels associated with the lower bound and the upper bound are quite close for any seed.

Snapshot 3: As in Snapshot 2, after having stopped the simulation by clicking the pause button and selecting the curve estimate associated with the percentiles is now plotted instead of the one associated with the (plotted in black): an important observation (perhaps more important than the closeness of the kernels mentioned above) is that this ARCL percentile curve estimate is almost indistinguishable from the ASE-optimal curve estimate as was also the case for the ARCL percentile curve estimate. This can be observed for almost every seed (the seed is 1 in this snapshot), albeit with sometimes small local fluctuations. This means that for such a setting, one can get a rather accurate plot of the unknown ASE-optimal curve estimate with a confidence level of 0.90.

Snapshot 4: Selecting the "bell + peak" function as the true function, setting , choosing a large noise level, and still selecting , we could find after enough simulations that the empirical values and converge respectively to 0.939 and 0.940, so the coverage error is small; however, for almost any seed, we can see that the ARCL percentile curve estimate and the ARCL percentile curve estimate are quite different.

Snapshot 5: As in Snapshot 4, having stopped the previous simulations, you can observe that, even if the peak seems to be present in this second view when using a small bandwidth (shown here for seed set to 1), the upper bound at level 0.95 (whose computed value is displayed in the top inset of the first view) yields a curve estimate where the peak is not present.

(This curve estimate is obtained in the second view by moving the slider "trial bandwidth" toward or by noticing that it automatically takes the dashed red appearance as soon as the bandwidth exceeds 0.428.)

With such data with a lot of noise, this methodology does not permit us to guarantee, at a confidence level of 0.95, the existence of this peak; nevertheless, the global "bell-like" shape is well guaranteed. However, it could be checked that with a lower noise level (e.g. as in the thumbnail), the confidence regions at level 0.95 that can be displayed in this second view do consist of curve estimates where the peak is almost always present.

Reference

[1] D. A. Girard, "Estimating the Accuracy of (Local) Cross-Validation via Randomised GCV Choices in Kernel or Smoothing Spline Regression," Journal of Nonparametric Statistics, 22(1), 2010, pp. 41–64. doi:10.1080/10485250903095820.



Feedback (field required)
Email (field required) Name
Occupation Organization
Note: Your message & contact information may be shared with the author of any specific Demonstration for which you give feedback.
Send