The new variance introduced in this article fixes two *big data* problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula.

**Synthetic Metrics**

This new metric is *synthetic*: It was not derived naturally from mathematics like the variance taught in any statistics 101 course, or the variance currently implemented in Hadoop (see above picture). By *synthetic*, I mean that it was built to address issues with big data (outliers) and the way many big data computations are now done: Map Reduce framework, Hadoop being an implementation. It is a top-down approach to metric design - from data to theory, rather than the bottom-up traditional approach - from theory to data.

Other synthetic metrics designed in our research laboratory include:

- Predictive power metric, related to entropy (that is, information quantification), used in big data frameworks, for instance to identify optimum feature combinations for scoring algorithms.
- Correlation for big data, defined by an algorithm and closely related to the optimum variance metric discussed here.
- Structuredness coefficient
- Bumpiness coefficient

**Hadoop, numerical and statistical stability**

There are two issues with the formula used for computing Variance in Hadoop. First, the formula used, namely Var(x1, ... , xn) = {SUM(xi^2)/n} - {SUM(xi)/n}^2, is notoriously unstable. For large n, while both terms cancel out somewhat, each one taken separately can take a huge value, because of the squares aggregated over billions of observations. It results in numerical inaccuracies, with people having reported negative variances. Read the comments attached to my article The curse of Big Data for details. Besides, there are variance formula that do not require two passes of the entire data sets, and that are numerically stable.

Of course, the formula currently used is so easy to implement in a Map Reduce environment such as Hadoop. You first split your data into 200 buckets. You compute both sums separately on these 200 data buckets (each computation simultaneously done on a separate server to decrease computation time by a factor 200), then perform 200 aggregations and a final square to get the value for the right-hand term.

My main concern though is not about numerical instability, but with the fact that in large data sets, outliers are a big problem and will screw up your variance. This variance formula will make variance comparisons across data bins meaningless, and will result in highly volatile, inaccurate predictions if the data is used for predictive modeling. The solution is to use a different type of variance, one that is not sensitive to outliers yet easy to compute with Hadoop.

**The abstract concept of variance**

Before introducing a new definition of variance, lets's first discuss what a variance metric should satisfy. The following are desirable properties, listed in order of importance: #1, #2 and #3 being the most fundamental, while #5, #6 and #7 are rarely needed.

- The variance is positive. It is equal to 0 if and only if x1 = ... = xn.
- The variance (denoted as V) is symmetrical. Any permutation of (x1, ... , xn) has same variance as V(x1, ... , xn).
- The further away you are from x1 = ... = xn, the higher the variance.
- The variance is translation-invariant: V(x1+c, ... , xn+c) = V(x1, ... , xn) for any constant c.
- The variance is scale-invariant: V(c*x1, ... , c*xn) = V(x1, ... , xn) for any constant c. Read first technical note below for explanations.
- The variance is invariant under a number of matrix transformations such as rotations, or reflections.
- The variance is bounded. For instance, V(x1, ... , xn) < 1.

Properties #1 and #2 are mandatory. Property #5 is not necessary; it is easy to derive a scale-invariant variance W from any variance V: define W(x1, ... , xn) as V(x1, ... , xn) divided by the average V(xi, xj) computed on all subsets {xi, xj} of {x1, ... , xn}. Property #7 is not necessary. You can turn any non-bounded variance V into a bounded one using W = V / (1+V). The disadvantage of a non-bounded variance is that a single outlier can make the variance arbitrarily large.

Property #4 is not necessary. You can derive a translation-invariant variance W from V as follows: W(x1, ..., xn) = max V(x1+c, ..., xn+c), where the maximum is computed over all **c**'s. This works only when the **c** achieving maximum V is finite, regardless of (x1, ..., xn).

I'd like to redefine property #3 differently. All the variances described here share a property similar to #3 anyway. But strictly speaking, property #3 means that the higher the distance **d** between (x1, ... ,xn) and the line containing the points (0, ... , 0) and (1, ..., 1) in the n-dimensional space, the higher the variance. It turns out that **d** is exactly the traditional standard deviation computed on (x1, ..., xn), and thus, the only variances satisfying property #3 would be monotonic functions of the traditional variance.

The new variance introduced here meets property #1 and #2. It is defined as follows:

Let g be an arbitrary strictly convex function, for instance g(x) = |x|^a (with a>1). Then

V(x1, ... , xn) = {SUM(g(xi))/n} - g(SUM(xi)/n).

If you use g(x)=|x|^a with a=2, then the formula coincides with the traditional variance. I recommend a much lower value, for instance a=1.25, to solve both the numerical instability and outlier issues. I wish we could use a=1, but then g would be convex but not strictly convex, and it just does not work at all. Interestingly, when transforming V (defined using g(x)= |x|^a) to make it translation-invariant, we get a new variance W which works perfectly well even with a=1, making computations faster: it takes far more time to compute |x|^1.25 than |x|^1.

When g(x) = |x|^a and 1 < a < 2, the **c** that maximizes V(x1+c, ..., xn+c) is **c** = -(x1 + ... + xn)/n. I haven't proved this result yet; I invite you to come with a formal proof. But I tested it on several data sets. Clearly, the result no longer holds when a=1 or a=2; if (and only if) a=2, the variance V and its translation-invariant version W are identical. So our final variance, after this transformation, is

W(x1, ... , xn) = SUM{g(xi - u)/n}, with u = (x1 + ... + xn)/n, g(x) = |x|^a and a=1.

**Technical note**: W satisfies property #4 (translation-invariant) under the L^a norm. If you transform the above W into a scale-invariant variance using the technique previously described, you will end up with a final metric that is good at detecting clusters and data structures with holes: The resulting metric takes higher values when strong cluster patterns are present, and conversely, low values when data points are more evenly spread. This works for 1 < a < 2, as well as for a=1. It does not work for a=2 (standard variance) as the resulting metric would then depend only on n: it would always be equal to (n-1)/(2n) regardless of (x1, ... ,xn), if a=2. So, the traditional variance is the only variance (with parameter **a** between 1 and 2) that can not be made scale-invariant using the approach described here. As an exercise, compute the variance for a=1, after transforming W (the translation-invariant version of V) to make it scale-invariant, on four data sets with n=10 points: first data set is {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, second data set is {0, 0, 10/3, 4.5, 4.5. 4.5. 4.5, 9, 9, 9}, third data set is {0, 0, 0, 0, 0, 9, 9, 9, 9, 9} and forth data set is {0, 0, 0, 0, 0, 1, 1, 1, 1, 1}. This will help you understand what a scale-invariant variance is. Note the the values computed on the third and fourth data sets are identical precisely because of the scale-invariant property. Finally, note that our formula to obtain a scale-invariant variance is O(n^2) in terms of computation, that is, very time consuming. Everything else discussed here is O(n). How can you improve this O(n^2)?

**A New Big Data Theorem**

Our research lab occasionally produces new theorems with immediate application to better process data - big and small.

Theorem: The formula V(x1, ... , xn) = {SUM(g(xi))/n} - g(SUM(xi)/n) defines a variance, satisfying properties #1 and #2, when g is a strictly convex function.

This result - I call it the *Fourth AnalyticBridge Theorem* - is a direct consequence of Jensen's inequality. For those interested, here are our three previous theorems:

- First Analyticbridge theorem: Used to build model-free confidence intervals.
- Second Analyticbridge theorem: Used in the context of random permutations.
*A random permutation of non-independent numbers constitutes a sequence of independent numbers*. - Third Analyticbridge theorem: used to prove that a newly defined correlation (used in the context of big data) takes values between -1 and 1. Proved by Jean-Francois Puget in 2013, in our first data science competition.

Exercise: Assuming X=(x1, ..., xn) and Y=(y1, ...,yn) are two vectors, then Cov(X, Y) = {V(X+Y) - [V(X)+V(Y)]}/2 defines the traditional covariance, if g(x)=|x|^a and a=2. How to adapt this formula to 1 < a < 2? Can you generalize this covariance definition to more than two vectors?

**Transformation-Invariant Metrics**

It is easy to prove that when a=2 (the traditional variance), V satisfies property #4, that is, it is translation invariant: adding a constant to all data points x1, ... , xn does not change V. This is no longer true when 1 < a < 2. For instance, with n=3, we have:

- when a=2 and x1=3, x2=5, x3=10, then V=8.67
- when a=2 and x1=10,003, x2=10,005, x3=10,010, then V=8.67 (unchanged)
- when a=1.25 and x1=3, x2=5, x3=10, then V=0.34
- when a=1.25 and x1=10,003, x2=10,005, x3=10,010, then V=0.00

This is one of the reasons why I don't like translation invariance and prefer a=1.25 over the standard a=2.

It can be easily be proved that to satisfy the translation-invariant property, the function g must satisfy g(x) - 2g(x+1) + g(x+2) = constant. The only functions g satisfying this constraint are quadratic forms (for instance when a=2), and are thus linked to the traditional variance. Likewise, to satisfy the scale-invariant property, the function g would have to satisfy the equation g(2x) - 2g(x) = constant. There is no strictly convex function satisfying this constraint, so none of our V's is scale-invariant. But you can use a transformation described above to build a variance W which is scale-invariant.

**Technical note**: The traditional variance defined by a=2 is also invariant under orthogonal transformations (reflections or rotations) that preserve the mean, that is, it is invariant under linear operators that are both orthogonal and stochastic. In matrix notations, such an operator is a n x n matrix A applied to the column vector X = (x1, ..., xn)' and having the sum of each column equal to 1. Here ' denotes the transposed vector (from row to column). In short, V(AX) = V(X) if A is such a matrix and a=2. A particular interesting case is when A is both doubly stochastic and orthogonal. Then A is a permutation matrix. Such matrices A can be produced using the following iterative algorithm: Start with any doubly stochastic matrix A_0; define A_{k+1}= (A_k - s***I**) / (**I** - s*A'_k) for k>0. Here we assume 0 < s < 1 for convergence, and the symbol **I** denotes the n x n identity matrix.

**Implementation: Communications versus Computational Costs**

One of the issues with Hadoop and Map Reduce is the time spent in data transfers, as you split a task into multiple parralel processes, each running on a different server. It shifts the problem of computational complexity from algorithm complexity to communication or bandwidth costs. By communication, I mean the amount of time moving data around - from a centralized server (data base of file management system) or distributed system, to memory or to hard disks. How to optimize these data transfers? How to optimize in-memory computations? You can not rely on a built-in function such as variance if it is not properly implemented, as shown here.

Indeed, the variance computation is an interesting problem that illustrates some Hadoop technical challenges. If no efficient built-in algorithm is available, it is necessary to do home-made computations that leverage the Map Reduce architecture, while at the same time moving as little data as possible over intranet networks (in particular, not downloading all the data once to compute the mean, then once again for the variance). Given the highly unstable variance formula used (a=2) in Hadoop, the solution is either to stabilize it, by removing a large value to all xi's before computing it. After all, with a=2, the variance is translation-invariant, so why not exploit this property? Even better, use a=1.25, which is much more robust against outliers. Or better, its translation-invariant version W, with a=1. But then you to have to write the code yourself or use some libraries.

**Bayesian Models**

Much of this discussion focused on a specific parameter a, with potential values between 1 and 2. This parameter could be assigned a prior distribution in a Bayesian framework.

## No comments:

## Post a Comment

Note: Only a member of this blog may post a comment.