The model-free, data-driven technique discussed here is so basic that it can easily be implemented in Excel, and we actually provide an Excel implementation. It is surprising that this technique does not pre-date standard linear regression, and is rarely if ever used by statisticians and data scientists. It is related to kriging and nearest neighbor interpolation, and apparently first mentioned in 1965 by Harvard scientists working on GIS (geographic information systems). It was referred back then as Shepard's method or inverse distance weighting, and used for multivariate interpolation on non-regular grids (see here and here). We call this technique simple regression.
In this article, we show how simple regression can be generalized and used in regression problems especially when standard regression fails due to multi-collinearity or other issues. It can safely be used by non-experts without risking misinterpretation of the results or over-fitting. We also show how to build confidence intervals for predicted values, compare it to linear regression on test data sets, and apply it to a non-linear context (regression on a circle) where standard regression fails. Not only it works for prediction inside the domain (equivalent to interpolation) but also, to a lesser extent and with extra care, outside the domain (equivalent to extrapolation). No matrix inversion or gradient descend is needed in the computations, making it a faster alternative to linear or logistic regression.
1. Simple regression explained
For ease of presentation, we only discuss the two-dimensional case. Generalization to any dimension is straightforward. Let us assume that the data set (also called training set) consists of n points or locations (X1, Y1), ..., (Xn, Yn) together with the response (also called dependent values) Z1, ..., Zn attached to each observation. Then the predicted value Z at an arbitrary location (X, Y) is computed as follows:
Throughout this article, we used
with β = 5. The parameter β controls the smoothness and is actually an hyper-parameter. It should be set to at least twice the dimension of the problem. A large value of β decreases the influence of far-away points in the predictions. In a Bayesian framework, a prior could be attached to β. Also note that if (X, Y) is one of the n training set points, say (X, Y) = (Xj, Yj) for some j, then Z must be set to Zj. In short, the predicted value is exact for points belonging to the training set. If (X, Y) is very close to say (Xj, Yj) and further away from the other training set points, then the computed Z is very close to Zj. It is assumed here that there are no duplicate locations in the training set otherwise, the formula needs adjustments.
2. Case studies and Excel spreadsheet with computations
We did some simulations to compare the performance of simple regression versus linear regression. In the first example, the training set consists of n = 100 data points generated as follows. The locations are random points (Xk, Yk) in the two-dimensional unit square [0, 1] x [0, 1]. The response was set to Zk = SQRT[(Xk)^2 + (Yk)^2]. The control set consists of another n = 100 points, also randomly distributed on the same unit square. The predicted values were computed on the control set, and the goal is to check how well they approximate the theoretical (true) value SQRT(X^2 + Y^2). Both the simple and linear regression perform well, though the R-squared is a little better for the simple regression, for most training and control sets of this type. The picture below shows the quality of the fit. A perfect fit would correspond to a perfect diagonal line rather than a cloud, with 0.9886 and 0.0089 (the slope and intercept of the red line) replaced respectively by 1 and 0. Note that the R-squared 0.9897 is very close to 1.
Figure 1: data set doing well with both simple and linear regression
2.1. Regression on the circle
In this second example, both the training set and control points are located on the unit circle (on the border of the circle, not inside or outside, so technically this a one-dimensional case). As expected the R-squared for the linear regression is terrible, and close to zero, while it is close to one for the simple regression. Note the weird distribution for the linear regression: this is not a glitch, it is expected to be that way.
Figure 2: Good fit with simple regression (points distributed on a circle)
Figure 3: Bad fit with linear regression (points distributed on the same circle as in Figure 2)
2.2. Extrapolation
In the third example, we used the same training set with random locations on the unit circle. The control set consists this time of n = 100 points located in a square away from the circle, with no intersection with the circle. This corresponds to extrapolation. Both the linear and simple regression perform badly this time. The R-squared associated with the linear regression is close to zero, so no amount of re-scaling can fix it. The predicted values appear random.
However, even though the simple regression results are almost as much off as those coming from the linear regression with respect to bias, they can be substantially improved, easily. The picture below illustrates this fact.
Figure 4: Testing predictions outside the domain (extrapolation)
The slope in figure 4 is 0.3784. For a perfect fit, it should be equal to one. However the R-squared for the simple regression is pretty good: 0.842. So if we multiply the predicted values by a constant so that the average predicted value, in the square outside the circle, if not heavily biased anymore, we would have a good fit with the same R-squared. Of course, this assumes that the true average value on the unit square domain is known, at least approximately. It is significantly different from the average value computed on the training set (the circle), thus the bias. This fix won't work for the linear regression, with the R-squared staying unchanged and close to zero after rescaling, even if we remove the bias.
2.3. Confidence intervals for predicted values
Here, we are back to using the first data set that worked well both for linear and simple regression, doing interpolation rather than extrapolation, as at the beginning of section 2. The control set is fixed, but we split the training set (consisting this time of 500 points) into 5 subsets. This approach is similar to cross-validation or bootstrapping, and allows us to compute confidence intervals for the predicted values. It works as follows:
- Repeat the whole procedure 5 times, using each time a different subset of the training set
- Estimate Z based on the location (X, Y) for each point in the control set, using the formula in section 1: we will have 5 different estimates for each point, one for each subset of the training set
- For each point in the control set, compute the minimum and maximum estimated value, out of the 5 predictions
- The confidence interval for each point has the minimum predicted value as lower bound, and the maximum as upper bound.
Of course the technique can be further refined, using percentiles rather than minimum and maximum for the bounds of the confidence intervals. The most modern way to do it is described in my book Statistics: New Foundations, Toolkit and Machine Learning Recipes, available here to DSC members. See chapters 15-16, pages 107-132.
The striking conclusions based on this test are as follows:
- The CI (confidence interval) based on simple regression is about 50% larger on average than the one based on linear regression
- The CI based on simple regression contains the true value 92% of the time, versus 24% of the time for the linear regression.
What is striking is the 92% achieved by the simple regression. Part of it is because the simple regression CI's are larger, but there is more to it.
2.4. Excel spreadsheet
All the data and tests discussed, including the computations, are available in my spreadsheet, allowing you to replicate the results or use it on your own data. You can download it here (krigi2.xlsx). The main tabs in the spreadsheet are
- Square
- Circle-Interpolation
- Circle-Extrapolation
- Square-CI-Summary
The remaining tabs are used for auxiliary computations and can be ignored.
4. Generalization
If you look at the main formula in section 1, the predicted Z is the quotient of two arithmetic means. The one at the numerator is a weighted mean, and the one at the denominator is a standard mean. But the formula will also work with other types of means, for example with the exponential mean discussed in one of my previous articles, here. The advantage of using such means, over the arithmetic mean, is that there are hyperparameters attached to them, thus allowing for more granular fine-tuning.
For example, the exponential mean of n numbers A1, ..., An is defined as
When the hyperparameter p tends to 1, it corresponds to the arithmetic mean. Here, use the exponential mean with
respectively for the numerator and denominator in the first formula in section 1. You can even use a different p for the numerator and denominator.
Other original exact interpolation techniques based on Fourier methods, in one dimension and for points equally spaced, are described in this article. Indeed, it was this type of interpolation that led me to investigate the material presented here. Robust, simple linear regression techniques are also described in chapter 1 in my book Statistics: New Foundations, Toolkit and Machine Learning Recipes, available here to DSC members.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.