Saturday, February 5, 2022

Fascinating Facts About Complex Random Variables and the Riemann Hypothesis

 Despite my long statistical and machine learning career both in academia and in the industry, I never heard of complex random variables until recently, when I stumbled upon them by chance while working on some number theory problem. However, I learned that they are used in several applications, including signal processing, quadrature amplitude modulation, information theory and actuarial sciences. See here and here

In this article, I provide a short overview of the topic, with application to understanding why the Riemann hypothesis (arguably the most famous unsolved mathematical conjecture of all times) might be true, using probabilistic arguments. Stat-of-the-art, recent developments about this conjecture are discussed in a way that most machine learning professionals can understand. The style of my presentation is very compact, with numerous references provided as needed. It is my hope that this will broaden the horizon of the reader, offering new modeling tools to her arsenal, and an off-the-beaten-path reading. The level of mathematics is rather simple and you need to know very little (if anything) about complex numbers. After all, these random variables can be understood as bivariate vectors (XY) with X representing the real part and Y the imaginary part. They are typically denoted as Z = X + iY, where the complex number i (whose square is equal to -1) is the imaginary unit. There are some subtle differences with bivariate real variables, and the interested reader can find more details here. The complex Gaussian variable (see here) is of course the most popular case.

1. Illustration with damped complex random walks

Let (Zk) be an infinite sequence of identically and independently distributed random variables, with P(Zk = 1) = P(Zk = -1) = 1/2. We define the damped sequence as 

The originality here is that s = σ + it is a complex number. The above sequence clearly converges if the real part of s (the real number σ) is strictly above 1. The computation of the variance (first for the real part of Z(s), then for the imaginary part, then the full variance) yields:

Here ζ is the Riemann zeta function. See also here. So we are dealing with a Riemann-zeta type of distribution; other examples of such distributions are found in one of my previous articles, here. The core result is that the damped sequence not only converges if σ  >  1 as announced earlier, but even if σ  > 1/2 when you look at the variance: σ  > 1/2 keeps the variance of the infinite sum Z(s), finite. This result, due to the fact that we are manipulating complex rather than real numbers, will be of crucial importance in the next section, focusing on an application. 

It is possible to plot the distribution of Z(s) depending on the complex parameter s (or equivalently, depending on two real parameters σ and t), using simulations. You can also compute its distribution numerically, using the inverse Fourier transform of its characteristic function. The characteristic function computed for τ being a real number, is given by the following surprising product:

1.1. Smoothed random walks and distribution of runs

This sub-section is useful for the application discussed in section 2, and also for its own sake. If you don't have much time, you can skip it, and come back to it later.

The sum of the first n terms of the series defining Z(s) represents a random walk (assuming n represents the time), with zero mean and variance equal to n (thus growing indefinitely with n) if s = 0; it can take on positive or negative values, and can stay positive (or negative) for a very long time, though it will eventually oscillate infinitely many times between positive and negative values (see here) if s = 0. The case s = 0 corresponds to the classic random walk. We define the smoothed version Z*(s) as follows:

run of length m is defined as a maximum subsequence Zk+1, ..., Zk+m all having the same sign: that is, m consecutive values all equal to +1, or all equal to -1. The probability for a run to be of length m  >  0, in the original sequence (Zk), is equal to 1 / 2^m. Here 2^m means 2 at power m. In the smoothed sequence (Z*k), after removing the zeroes, that probability is now 2 / 3^m.  While by construction the Zk's are independent, note that the Z*k's are not independent anymore. After removing all the zeroes (representing 50% of the Z*k's), the runs in the sequence (Z*k) tend to be much shorter than those in (Zk). This implies that the associated random walk (now actually less random) based on the Z*k's is better controlled, and can't go up and up (or down and down) for so long, unlike in the original random walk based on the Zk's. A classic result, known as the law of the iterated logarithm, states that

almost surely (that is, with probability 1). The definition of "lim sup" can be found here. Of course, this is no longer true for the sequence (Z*k) even after removing the zeroes.

2. Application: heuristic proof of the Riemann hypothesis

The Riemann hypothesis, one of the most famous unsolved mathematical problems, is discussed here, and in the DSC article entitled Will big data solved the Riemann hypothesis. We approach this problem using a function L(s) that behaves (to some extent) like the Z(s) defined in section 1. We start with the following definitions:

where

  • Ω(k) is the prime omega function, counting the number of primes (including multiplicity) dividing k,
  • λ(k) is the Liouville function,
  • p1p2, and so on (with p1 = 2) are the prime numbers.

Note that L(s, 1) = ζ(s) is the Riemann zeta function, and L(s) = ζ(2s) / ζ(s). Again, s = σ + it is a complex number. We also define Ln = Ln(0) and ρ = L(0, 1/2). We have L(1) = 0. The series for L(s) converges for sure if σ  >  1.

2.1. How to prove the Riemann hypothesis?

Any of the following conjectures, if proven, would make the Riemann hypothesis true:

  • The series for L(s) also converges if  σ  >  1/2: this is what we investigate in section 2.2. If it were to converge only if σ is larger than (say) σ0 = 0.65, it would imply that the Riemann Hypothesis (RH) is not true in the critical strip 1/2  <  σ  < 1, but only in σ0 <  σ  <  1. It would still be a major victory, allowing us to get much more precise estimates about the distribution of prime numbers, than currently known today. RH is equivalent to the fact that ζ(s) has no zero if 1/2  <  σ  <  1.
  • The number ρ is a normal number in base 2 (this would prove the much stronger Chowla conjecture, see here)
  • The sequence (λ(k)) is ergodic (this would also prove the much stronger Chowla conjecture, see here)
  • The sequence x(n+1) = 2x(n) - INT(2x(n)), with x(0) = (1 + ρ) / 2, is ergodic. This is equivalent to the previous statement. Here INT stands for the integer part function, and the x(n)'s are iterates of the Bernoulli map, one of the simple chaotic discrete dynamical systems (see Update 2 in this post) with its main invariant distribution being uniform on [0, 1]
  • The function 1 / L(s) = ζ(s) / ζ(2s) has no zero if 1/2  <  σ  <  1
  • The numbers λ(k)'s behave in a way that is random enough, so that for any ε  >  0, we have: (see here)

Note that the last statement is weaker than the law of the iterated logarithm mentioned in section 1.1. The coefficient λ(k) plays the same role as Zk in section 1, however because λ(mn) = λ(m)λ(n), they can't be independent, not even asymptotically independent, unlike the Zk's. Clearly, the sequence (λ(k)) has weak dependencies. That in itself does not prevent the law of the iterated logarithm from applying (see examples here) nor does it prevent ρ from being a normal number (see here why). But it is conjectured that the law of the iterated logarithm does not apply to the sequence (λ(k)), due to another conjecture by Gonek (see here).

2.2. Probabilistic arguments in favor of the Riemann hypothesis

The deterministic sequence (λ(k)), consisting of +1 and -1 in a ratio 50/50, appears to behave rather randomly (if you look at its limiting empirical distribution), just like the sequence (Zk) in section 1 behaves perfectly randomly. Thus, one might think that the series defining L(s) would also converge for σ  >  1/2, not just for σ  >  1. Why this could be true is because the same thing happens to Z(s) in section 1, for the same reason. And if it is true, then the Riemann hypothesis is true, because of the first statement in the bullet list in section 2.1. Remember, s = σ it, or in other words, σ is the real part of the complex number s

However, there is a big caveat, that maybe could be addressed to make the arguments more convincing. This is the purpose of this section. As noted at the bottom of section 2.1, the sequence (λ(k)), even though it passes all the randomness tests that I have tried, is much less random than it appears to be. It is obvious that it has weak dependencies since the function λ is multiplicative: λ(mn) = λ(m)λ(n). This is related to the fact that prime numbers are not perfectly randomly distributed. Another disturbing fact is that Ln, the equivalent of the random walk defined in section 1, seems biased towards negative values. For instance, (except for n = 1), it is negative up to n = 906,150,257, a fact proved in 1980, and thus disproving Polya's conjecture (see here). One way to address this is to work with Rademacher multiplicative random functions instead of (Zk) in section 1: see here for an example that would make the last item in the bullet list in section 2.1, be true. Or see here for an example that preserves the law of the iterated logarithm (which itself would also imply the Riemann Hypothesis). 

Finally, working with a smoothed version of L(s) or Ln using the smoothing technique described in section 1.1, may  lead to results easier to obtain, with a possibility that it would bring new insights for the original series L(s). The smoothed version L*(s) is defined, using the same technique as in section 1.1, as

The function η(s) is the Dirichlet eta function, and L*(s) can be computed in Mathematica using (DirichletEta[s] + Zeta[2s] / Zeta[s]) / 2. Mathematica uses the analytic continuation of the ζ function if σ  <  1. For instance, see computation of L*(0.7) = -0.237771..., here. A table of the first million Liouville numbers λ(k) can be produced in Mathematica in just a few seconds, using the command Table[LiouvilleLambda[n], {n, 1, 1000000}]. For convenience, I stored them in a text file, here. It would be interesting to see how good (or bad) they are at producing a pseudorandom number generator.

New Machine Learning Optimization Technique - Part 1

 In this series, we discuss a technique to find either the minima or the roots of a chaotic, unsmooth, or discrete function. A root-finding technique that works well for continuous, differentiable functions is successfully adapted and applied to piece-wise constant functions with an infinite number of discontinuities. It even works if the function has no root: it will then find minima instead. In order to work, some constraints must be put on the parameters used in the algorithm, while avoiding over-fitting at the same time. This would also be true true anyway for the smooth, continuous, differentiable case. It does not work in the classical sense where an iterative algorithm converges to a solution. Here the iterative algorithm always diverges, yet it has a clear stopping rule that tells us when we are very close to a solution, and what to do to find the exact solution. This is the originality of the method, and why I call it new. Our technique, like many machine learning techniques, can generate false positives or false negatives, and one aspect of our methodology is to minimize this problem.

Applications are discussed, as well as full implementation with results, for a real-life, challenging case, and on simulated data. This first article in this series is a detailed introduction. Interestingly, we also show an example where a continuous, differentiable function, with a very large number of wild oscillations, benefit from being transformed into a non-continuous, non-differentiable function and then use our non-standard technique to find roots (or minima) as the standard technique fails. For the moment, we limit ourselves to the one-dimensional case. 

1. Example of problem

In the following, a = 7919 x 3083 is a constant, and b is the variable. We try to find the two roots b = 7919 and b = 3083 (both prime numbers) of the function f(b) = 2 - cos(2πb) - cos(2πa/b). We will just look between b = 2000 and b = 4000, to find the root b = 3083. This function is plotted below, between b = 2900 and b = 3200. The X-axis represents b, the Y-axis represents f(b).

Despite the appearances, this function is well behaved, smooth, continuous and differentiable everywhere between b = 2000 and b = 4000 : this would be apparent if you could zoom in a few times on the picture. Yet, it is no surprise that classical root-finding or minimum-finding algorithms such as Newton-Raphson (see here) fail, or require a very large number of iterations to converge, or require to start very close to the unknown root, and are thus of no practical value. For instance, Mathematica can't find a global minimum even if you narrow down the search to 3000  <  b  < 3200, see here.

In this example, clearly f(b) = 0 in the interval 2000  <  b  < 4000 (and this is also the minimum possible value) if and only if b divides a. In other to solve this problem, we transformed f into a new function g, which despite being unsmooth, lead to a much faster algorithm. The new function g, as well as its smoothed version h, are pictured below (g is in blue, h is in red). In this case, our method solves the factoring problem (factoring the number a) in relatively few iterations, potentially much faster than sequentially identifying and trying all the 138 prime divisors of a that are between 2000 and 3083.

However, this is not by far the best factoring algorithm as it was not designed specifically for that purpose, but rather for general purpose. In a subsequent article part of this series, we apply the methodology to data that behaves somehow like in this example, but with random numbers: in that case, it is impossible to "guess" what the roots are, yet the algorithm is just as efficient.

2. Fundamentals of our new algorithm

This section outlines the main features of the algorithm. First, you want to magnify the effect of the root. In the first figure, roots (or minima) are invisible to the naked eye, or at least undistinguishable from many other values that are very close to zero. To achieve this goal (assuming f is positive everywhere) replace a suitably discretized version of f(x) by g(x) = log(ε + λf(x)), with ε  >  0 close to zero. Then, in order to enlarge the width of the "hole" created around a root (in this case around b = 3083), you use some kind of moving average, possibly followed by a shift on the Y-axis.

The algorithm then proceeds as fixed-point iterations: bn+1 = bn + μ g(bn). Here we started with b0 = 2000. Rescaling is optional, if you want to keep the iterates bounded. One that does this trick here, is bn+1 = bn + μ g(bn) / SQRT(bn). Assuming the iterations approach the root (or minima) from the right direction, once it hits the "hole", the algorithm emits a signal, but then continue without ever ending, without ever converging. You stop when you see the signal, or after a fixed number of iterations if no signal ever shows up. In the latter case, you just missed the root (the equivalent of a false negative). 

The signal is measured as the ratio (bn - bn-1) / (bn+1 - bn), usually close to one, but which dramatically spikes just after entering the hole, depending on the parameters. In some cases, the signal may be weaker (or absent or multiple signals), and can result in false positives. In some cases, the second signal (when present) corresponded to the second root. Even if there is no root but a minimum instead, as in the above figure, the signal may still be present. Below is a picture featuring the signal, occurring at iteration n = 22: it signals that b21 = 3085.834 is in close vicinity of the root b = 3083. The X-axis represents the iteration number in the fixed point algorithm. How close to a root you end up is determined by the size of the window used for the moving average. A large window increases your chances to find the root, but the price to pay is less accuracy, possibly a weaker signal, and a lot more computations.

The closest to my methodology, in the literature, is probably the discrete fixed point algorithm, see here. To provide an analogy to physicists, it is a bit like the "hole" is a black hole, attracting massive objects that progress around it for some time, then get absorbed, emitting a radio signal when that happens, as the only proof revealing the existence of the black hole. The closer to zero ε is, the deeper the hole, and "abyss" may be a better word to describe it. Another, possibly more realistic way to describe it, is a rocket flying into space, then approaching a massive object (giant planet, or the sun), and getting its straight path seriously deflected after the encounter. The rocket continues to fly indefinitely after the encounter (just like my iterations of the fixed point algorithm continue indefinitely to nowhere after the encounter with the hole, thought on many occasions they will oscillate for a long time around a root - the hole - before eventually moving away). But the path deviation produced by the massive planet indicates that the rocket got close to a massive object, or equivalently in our case, the signal indicates we got close to a root of the function of interest. 

3. Details

All the details will be provided in the next articles in this series. To not miss them, you can subscribe to our newsletter, here. We will discuss the following:

  • Source code and potential applications (e.g. Brownian bridges)
  • How to smooth chaotic curves, and visualization issues (see our red curve in the second figure - we will discuss how it was created)
  • How to optimize the parameters in our method without overfitting
  • How to improve our algorithm
  • How we used only local optimization without  storing a large table of f or g values -- indeed it does not use any memory; yet if successful it finds a global minimum or a root (this is very useful if your target interval to find a minimum or root is very large, or if each call to f or g is very time consuming)
  • A discussion on how to generalize the modulus function to non-integer numbers, and investigate the properties of modulus functions for real numbers, not just integers.

Wednesday, February 2, 2022

Data Animation: Much Easier than you Think!

 In this article, I explain how to easily turn data into videos. Data animations add value to your presentations. They also constitute a useful tool in exploratory data analysis. Of course, as in any data visualization, carefully choice of what to put in your video, and how to format it, is critical. The technique described here barely requires any programming skills. You may be able to produce your first video in less than one hour of work, and even with no coding at all. I also believe that data camps and machine learning courses should include this topic in their curriculum.

Examples include convergence of a 2D series related to the Riemann Hypothesis, and fractal supervised classification (see below). Full source code included.

Read full article here.












Fuzzy Regression: A Generic, Model-free, Math-free Machine Learning Technique

  A different way to do regression with prediction intervals. In Python and without math. No calculus, no matrix algebra, no statistical eng...