Saturday, February 5, 2022

Fascinating Facts About Complex Random Variables and the Riemann Hypothesis

 Despite my long statistical and machine learning career both in academia and in the industry, I never heard of complex random variables until recently, when I stumbled upon them by chance while working on some number theory problem. However, I learned that they are used in several applications, including signal processing, quadrature amplitude modulation, information theory and actuarial sciences. See here and here

In this article, I provide a short overview of the topic, with application to understanding why the Riemann hypothesis (arguably the most famous unsolved mathematical conjecture of all times) might be true, using probabilistic arguments. Stat-of-the-art, recent developments about this conjecture are discussed in a way that most machine learning professionals can understand. The style of my presentation is very compact, with numerous references provided as needed. It is my hope that this will broaden the horizon of the reader, offering new modeling tools to her arsenal, and an off-the-beaten-path reading. The level of mathematics is rather simple and you need to know very little (if anything) about complex numbers. After all, these random variables can be understood as bivariate vectors (XY) with X representing the real part and Y the imaginary part. They are typically denoted as Z = X + iY, where the complex number i (whose square is equal to -1) is the imaginary unit. There are some subtle differences with bivariate real variables, and the interested reader can find more details here. The complex Gaussian variable (see here) is of course the most popular case.

1. Illustration with damped complex random walks

Let (Zk) be an infinite sequence of identically and independently distributed random variables, with P(Zk = 1) = P(Zk = -1) = 1/2. We define the damped sequence as 

The originality here is that s = σ + it is a complex number. The above sequence clearly converges if the real part of s (the real number σ) is strictly above 1. The computation of the variance (first for the real part of Z(s), then for the imaginary part, then the full variance) yields:

Here ζ is the Riemann zeta function. See also here. So we are dealing with a Riemann-zeta type of distribution; other examples of such distributions are found in one of my previous articles, here. The core result is that the damped sequence not only converges if σ  >  1 as announced earlier, but even if σ  > 1/2 when you look at the variance: σ  > 1/2 keeps the variance of the infinite sum Z(s), finite. This result, due to the fact that we are manipulating complex rather than real numbers, will be of crucial importance in the next section, focusing on an application. 

It is possible to plot the distribution of Z(s) depending on the complex parameter s (or equivalently, depending on two real parameters σ and t), using simulations. You can also compute its distribution numerically, using the inverse Fourier transform of its characteristic function. The characteristic function computed for τ being a real number, is given by the following surprising product:

1.1. Smoothed random walks and distribution of runs

This sub-section is useful for the application discussed in section 2, and also for its own sake. If you don't have much time, you can skip it, and come back to it later.

The sum of the first n terms of the series defining Z(s) represents a random walk (assuming n represents the time), with zero mean and variance equal to n (thus growing indefinitely with n) if s = 0; it can take on positive or negative values, and can stay positive (or negative) for a very long time, though it will eventually oscillate infinitely many times between positive and negative values (see here) if s = 0. The case s = 0 corresponds to the classic random walk. We define the smoothed version Z*(s) as follows:

run of length m is defined as a maximum subsequence Zk+1, ..., Zk+m all having the same sign: that is, m consecutive values all equal to +1, or all equal to -1. The probability for a run to be of length m  >  0, in the original sequence (Zk), is equal to 1 / 2^m. Here 2^m means 2 at power m. In the smoothed sequence (Z*k), after removing the zeroes, that probability is now 2 / 3^m.  While by construction the Zk's are independent, note that the Z*k's are not independent anymore. After removing all the zeroes (representing 50% of the Z*k's), the runs in the sequence (Z*k) tend to be much shorter than those in (Zk). This implies that the associated random walk (now actually less random) based on the Z*k's is better controlled, and can't go up and up (or down and down) for so long, unlike in the original random walk based on the Zk's. A classic result, known as the law of the iterated logarithm, states that

almost surely (that is, with probability 1). The definition of "lim sup" can be found here. Of course, this is no longer true for the sequence (Z*k) even after removing the zeroes.

2. Application: heuristic proof of the Riemann hypothesis

The Riemann hypothesis, one of the most famous unsolved mathematical problems, is discussed here, and in the DSC article entitled Will big data solved the Riemann hypothesis. We approach this problem using a function L(s) that behaves (to some extent) like the Z(s) defined in section 1. We start with the following definitions:

where

  • Ω(k) is the prime omega function, counting the number of primes (including multiplicity) dividing k,
  • λ(k) is the Liouville function,
  • p1p2, and so on (with p1 = 2) are the prime numbers.

Note that L(s, 1) = ζ(s) is the Riemann zeta function, and L(s) = ζ(2s) / ζ(s). Again, s = σ + it is a complex number. We also define Ln = Ln(0) and ρ = L(0, 1/2). We have L(1) = 0. The series for L(s) converges for sure if σ  >  1.

2.1. How to prove the Riemann hypothesis?

Any of the following conjectures, if proven, would make the Riemann hypothesis true:

  • The series for L(s) also converges if  σ  >  1/2: this is what we investigate in section 2.2. If it were to converge only if σ is larger than (say) σ0 = 0.65, it would imply that the Riemann Hypothesis (RH) is not true in the critical strip 1/2  <  σ  < 1, but only in σ0 <  σ  <  1. It would still be a major victory, allowing us to get much more precise estimates about the distribution of prime numbers, than currently known today. RH is equivalent to the fact that ζ(s) has no zero if 1/2  <  σ  <  1.
  • The number ρ is a normal number in base 2 (this would prove the much stronger Chowla conjecture, see here)
  • The sequence (λ(k)) is ergodic (this would also prove the much stronger Chowla conjecture, see here)
  • The sequence x(n+1) = 2x(n) - INT(2x(n)), with x(0) = (1 + ρ) / 2, is ergodic. This is equivalent to the previous statement. Here INT stands for the integer part function, and the x(n)'s are iterates of the Bernoulli map, one of the simple chaotic discrete dynamical systems (see Update 2 in this post) with its main invariant distribution being uniform on [0, 1]
  • The function 1 / L(s) = ζ(s) / ζ(2s) has no zero if 1/2  <  σ  <  1
  • The numbers λ(k)'s behave in a way that is random enough, so that for any ε  >  0, we have: (see here)

Note that the last statement is weaker than the law of the iterated logarithm mentioned in section 1.1. The coefficient λ(k) plays the same role as Zk in section 1, however because λ(mn) = λ(m)λ(n), they can't be independent, not even asymptotically independent, unlike the Zk's. Clearly, the sequence (λ(k)) has weak dependencies. That in itself does not prevent the law of the iterated logarithm from applying (see examples here) nor does it prevent ρ from being a normal number (see here why). But it is conjectured that the law of the iterated logarithm does not apply to the sequence (λ(k)), due to another conjecture by Gonek (see here).

2.2. Probabilistic arguments in favor of the Riemann hypothesis

The deterministic sequence (λ(k)), consisting of +1 and -1 in a ratio 50/50, appears to behave rather randomly (if you look at its limiting empirical distribution), just like the sequence (Zk) in section 1 behaves perfectly randomly. Thus, one might think that the series defining L(s) would also converge for σ  >  1/2, not just for σ  >  1. Why this could be true is because the same thing happens to Z(s) in section 1, for the same reason. And if it is true, then the Riemann hypothesis is true, because of the first statement in the bullet list in section 2.1. Remember, s = σ it, or in other words, σ is the real part of the complex number s

However, there is a big caveat, that maybe could be addressed to make the arguments more convincing. This is the purpose of this section. As noted at the bottom of section 2.1, the sequence (λ(k)), even though it passes all the randomness tests that I have tried, is much less random than it appears to be. It is obvious that it has weak dependencies since the function λ is multiplicative: λ(mn) = λ(m)λ(n). This is related to the fact that prime numbers are not perfectly randomly distributed. Another disturbing fact is that Ln, the equivalent of the random walk defined in section 1, seems biased towards negative values. For instance, (except for n = 1), it is negative up to n = 906,150,257, a fact proved in 1980, and thus disproving Polya's conjecture (see here). One way to address this is to work with Rademacher multiplicative random functions instead of (Zk) in section 1: see here for an example that would make the last item in the bullet list in section 2.1, be true. Or see here for an example that preserves the law of the iterated logarithm (which itself would also imply the Riemann Hypothesis). 

Finally, working with a smoothed version of L(s) or Ln using the smoothing technique described in section 1.1, may  lead to results easier to obtain, with a possibility that it would bring new insights for the original series L(s). The smoothed version L*(s) is defined, using the same technique as in section 1.1, as

The function η(s) is the Dirichlet eta function, and L*(s) can be computed in Mathematica using (DirichletEta[s] + Zeta[2s] / Zeta[s]) / 2. Mathematica uses the analytic continuation of the ζ function if σ  <  1. For instance, see computation of L*(0.7) = -0.237771..., here. A table of the first million Liouville numbers λ(k) can be produced in Mathematica in just a few seconds, using the command Table[LiouvilleLambda[n], {n, 1, 1000000}]. For convenience, I stored them in a text file, here. It would be interesting to see how good (or bad) they are at producing a pseudorandom number generator.

New Machine Learning Optimization Technique - Part 1

 In this series, we discuss a technique to find either the minima or the roots of a chaotic, unsmooth, or discrete function. A root-finding technique that works well for continuous, differentiable functions is successfully adapted and applied to piece-wise constant functions with an infinite number of discontinuities. It even works if the function has no root: it will then find minima instead. In order to work, some constraints must be put on the parameters used in the algorithm, while avoiding over-fitting at the same time. This would also be true true anyway for the smooth, continuous, differentiable case. It does not work in the classical sense where an iterative algorithm converges to a solution. Here the iterative algorithm always diverges, yet it has a clear stopping rule that tells us when we are very close to a solution, and what to do to find the exact solution. This is the originality of the method, and why I call it new. Our technique, like many machine learning techniques, can generate false positives or false negatives, and one aspect of our methodology is to minimize this problem.

Applications are discussed, as well as full implementation with results, for a real-life, challenging case, and on simulated data. This first article in this series is a detailed introduction. Interestingly, we also show an example where a continuous, differentiable function, with a very large number of wild oscillations, benefit from being transformed into a non-continuous, non-differentiable function and then use our non-standard technique to find roots (or minima) as the standard technique fails. For the moment, we limit ourselves to the one-dimensional case. 

1. Example of problem

In the following, a = 7919 x 3083 is a constant, and b is the variable. We try to find the two roots b = 7919 and b = 3083 (both prime numbers) of the function f(b) = 2 - cos(2πb) - cos(2πa/b). We will just look between b = 2000 and b = 4000, to find the root b = 3083. This function is plotted below, between b = 2900 and b = 3200. The X-axis represents b, the Y-axis represents f(b).

Despite the appearances, this function is well behaved, smooth, continuous and differentiable everywhere between b = 2000 and b = 4000 : this would be apparent if you could zoom in a few times on the picture. Yet, it is no surprise that classical root-finding or minimum-finding algorithms such as Newton-Raphson (see here) fail, or require a very large number of iterations to converge, or require to start very close to the unknown root, and are thus of no practical value. For instance, Mathematica can't find a global minimum even if you narrow down the search to 3000  <  b  < 3200, see here.

In this example, clearly f(b) = 0 in the interval 2000  <  b  < 4000 (and this is also the minimum possible value) if and only if b divides a. In other to solve this problem, we transformed f into a new function g, which despite being unsmooth, lead to a much faster algorithm. The new function g, as well as its smoothed version h, are pictured below (g is in blue, h is in red). In this case, our method solves the factoring problem (factoring the number a) in relatively few iterations, potentially much faster than sequentially identifying and trying all the 138 prime divisors of a that are between 2000 and 3083.

However, this is not by far the best factoring algorithm as it was not designed specifically for that purpose, but rather for general purpose. In a subsequent article part of this series, we apply the methodology to data that behaves somehow like in this example, but with random numbers: in that case, it is impossible to "guess" what the roots are, yet the algorithm is just as efficient.

2. Fundamentals of our new algorithm

This section outlines the main features of the algorithm. First, you want to magnify the effect of the root. In the first figure, roots (or minima) are invisible to the naked eye, or at least undistinguishable from many other values that are very close to zero. To achieve this goal (assuming f is positive everywhere) replace a suitably discretized version of f(x) by g(x) = log(ε + λf(x)), with ε  >  0 close to zero. Then, in order to enlarge the width of the "hole" created around a root (in this case around b = 3083), you use some kind of moving average, possibly followed by a shift on the Y-axis.

The algorithm then proceeds as fixed-point iterations: bn+1 = bn + μ g(bn). Here we started with b0 = 2000. Rescaling is optional, if you want to keep the iterates bounded. One that does this trick here, is bn+1 = bn + μ g(bn) / SQRT(bn). Assuming the iterations approach the root (or minima) from the right direction, once it hits the "hole", the algorithm emits a signal, but then continue without ever ending, without ever converging. You stop when you see the signal, or after a fixed number of iterations if no signal ever shows up. In the latter case, you just missed the root (the equivalent of a false negative). 

The signal is measured as the ratio (bn - bn-1) / (bn+1 - bn), usually close to one, but which dramatically spikes just after entering the hole, depending on the parameters. In some cases, the signal may be weaker (or absent or multiple signals), and can result in false positives. In some cases, the second signal (when present) corresponded to the second root. Even if there is no root but a minimum instead, as in the above figure, the signal may still be present. Below is a picture featuring the signal, occurring at iteration n = 22: it signals that b21 = 3085.834 is in close vicinity of the root b = 3083. The X-axis represents the iteration number in the fixed point algorithm. How close to a root you end up is determined by the size of the window used for the moving average. A large window increases your chances to find the root, but the price to pay is less accuracy, possibly a weaker signal, and a lot more computations.

The closest to my methodology, in the literature, is probably the discrete fixed point algorithm, see here. To provide an analogy to physicists, it is a bit like the "hole" is a black hole, attracting massive objects that progress around it for some time, then get absorbed, emitting a radio signal when that happens, as the only proof revealing the existence of the black hole. The closer to zero ε is, the deeper the hole, and "abyss" may be a better word to describe it. Another, possibly more realistic way to describe it, is a rocket flying into space, then approaching a massive object (giant planet, or the sun), and getting its straight path seriously deflected after the encounter. The rocket continues to fly indefinitely after the encounter (just like my iterations of the fixed point algorithm continue indefinitely to nowhere after the encounter with the hole, thought on many occasions they will oscillate for a long time around a root - the hole - before eventually moving away). But the path deviation produced by the massive planet indicates that the rocket got close to a massive object, or equivalently in our case, the signal indicates we got close to a root of the function of interest. 

3. Details

All the details will be provided in the next articles in this series. To not miss them, you can subscribe to our newsletter, here. We will discuss the following:

  • Source code and potential applications (e.g. Brownian bridges)
  • How to smooth chaotic curves, and visualization issues (see our red curve in the second figure - we will discuss how it was created)
  • How to optimize the parameters in our method without overfitting
  • How to improve our algorithm
  • How we used only local optimization without  storing a large table of f or g values -- indeed it does not use any memory; yet if successful it finds a global minimum or a root (this is very useful if your target interval to find a minimum or root is very large, or if each call to f or g is very time consuming)
  • A discussion on how to generalize the modulus function to non-integer numbers, and investigate the properties of modulus functions for real numbers, not just integers.

Wednesday, February 2, 2022

Data Animation: Much Easier than you Think!

 In this article, I explain how to easily turn data into videos. Data animations add value to your presentations. They also constitute a useful tool in exploratory data analysis. Of course, as in any data visualization, carefully choice of what to put in your video, and how to format it, is critical. The technique described here barely requires any programming skills. You may be able to produce your first video in less than one hour of work, and even with no coding at all. I also believe that data camps and machine learning courses should include this topic in their curriculum.

Examples include convergence of a 2D series related to the Riemann Hypothesis, and fractal supervised classification (see below). Full source code included.

Read full article here.












Monday, January 17, 2022

Neat R Plots with the Cairo Graphics Library

As I spent much of my career in business intelligence and marketing, programming in R was never one of my top skills. While I started programming in R more than 30 years ago, it was never one of my core activities either. Yet I have a background in machine learning and image processing, so I am definitely used to producing high quality graphics. Even in Excel, I gained a lot of experience producing such nice plots.


To me, despite the fact (especially decades ago) that R was used mainly to produce all sorts of plots, I’ve found the images produced by R to be of poor quality, but I was too busy on other things to really care about it. Indeed, millions of plots have been generated by millions of programmers since R’s beginning (or its sister language, S+), but the majority of these plots are ugly, even some that are produced today.

I did some research to make sure the proposed fix in this article is not an obscure trick that only people with rusty programming skills would use. It turns out that my fix is widely used. It is based on the Cairo graphics library, which can be used in a variety of programming languages including Python, not just R. Its main feature (the one that attracted me) is its anti-aliasing capabilities when creating shapes such as lines or circles. Without anti-aliasing, lines (for instance) appear as broken segments, and it looks low resolution and pretty ugly: this is still the way it works today when displaying images (on your screen) produced by a command such as plot, using the default R environment.

Before digging into the details, let me show you the contrast between an image with anti-aliasing, and one lacking it.

Read the full article, with solution (just two lines of code!) and fixed version of the above picture, here.

Thursday, December 9, 2021

Poisson-binomial Stochastic Processes: Introduction, Simulations, Inference

In simple English, we introduce a special, off-the-beaten-path type of point process called Poisson-binomial. We analyze its properties and perform simulations to see the distribution of points that it generates, in one and two dimensions, as well as to make inference about its parameters. Statistics of interest include distance to nearest neighbors or interarrival times  (some times called increments).  Combined with radial processes, it can be used to model complex systems involving clustering, for instance the distribution of matter in the universe. Limiting cases are Poisson processes, and this may lead to what is probably the simplest definition of of stochastic, stationary Poisson point process. Our approach is not based on traditional statistical science, but instead on modern machine learning methodology. Some of the tests performed reveal strong patterns invisible to the naked eye.

Probably the biggest takeaway from this article is its tutorial value. It shows how to handle a machine learning problem involving the construction of a new model, from start to finish, with all the steps described at a high level, accessible to professionals with one year worth of academic training in quantitative fields. Numerous references are provided to help the interested reader dive into the technicalities. This article covers a large amount of concepts in a compact style: for instance, stationarity, isotropy, intensity function, paired and unpaired data, order statistics, hidden processes, interarrival times, point count distribution, model fitting, model-free confidence intervals, simulations, radial processes, renewal process, nearest neighbors, model identifiability, compound or hierarchical processes, Mahalanobis distance, rescaling, and more. It can be used as a general introduction to point processes. 

Related probability distributions include logistic, Laplace, uniform, Cauchy, Poisson, Erlang, Poisson-binomial (the distribution these processes are named after), and many that don't have a name. These distributions are used in this article. Part 1 of this article can be found here. The link below points to part 2.

The accompanying spreadsheet has its own tutorial value, as it uses powerful Excel functions that are overlooked by many. Source code is also provided and included in the spreadsheet.


Read the full article here.



Content

1. Unpaired two-dimensional processes

2. Cluster and hierarchical point processes

  • Radial process
  • Basic cluster process
  • The spreadsheet
3. Machine learning inference
  • Interarrival time, and estimation of the intensity
  • Estimation of the scaling factor
  • Confidence intervals, model fitting, and identifiability issues
  • Estimation in higher dimensions
  • Results from some testing, including about radiality
  • Distributions related to the inverse or hidden process
  • Convergence to the Poisson process

Wednesday, December 1, 2021

A Gentle, Original Approach to Stochastic Point Processes

 In this article, original stochastic processes are introduced. They may constitute one of the simplest examples and definitions of point processes. A limiting case is the standard Poisson process - the mother of all point processes - used in so many spatial statistics applications. We first start with the one-dimensional case, before moving to 2-D. Little probability theory  knowledge is required to understand the content. In particular, we avoid discussing measure theory, Palm distributions, and other hard-to-understand concepts that would deter many users. Nevertheless, we dive pretty deep into the details, using simple English rather than arcane abstractions, to show the potential. A spreadsheet with simulations is provided, and model-free statistical inference techniques are discussed, including model-fitting and test for radial distribution. It is shown that two realizations of very different processes can look almost identical to the naked eye, while they actually have fundamental differences that can only be detected with machine learning techniques. Cluster processes are also presented in a very intuitive way.  Various probability distributions, including logistic, Poisson-binomial and Erlang, are attached to these processes, for instance to model the distribution and/or number of points in some area, or distances to nearest neighbors. 



Read full article here

Saturday, September 4, 2021

Machine Learning Perspective on the Twin Prime Conjecture

 This article focuses on the machine learning aspects of the problem, and the use of pattern recognition techniques leading to interesting, new findings about twin primes. Twin primes are prime numbers p such that p + 2 is also prime. For instance, 3 and 5, or 29 and 31. A famous, unsolved and old mathematical conjecture states that there are infinitely many such primes, but a proof still remains elusive to this day. Twin primes are far rarer than primes: there are infinitely more primes than there are twin primes, in the same way that that there are infinitely more integers than there are prime integers.


Here I discuss the results of my experimental math research, based on big data, algorithms, machine learning, and pattern discovery. The level is accessible to all machine learning practitioners. I first discuss my experimentations in section 1, and then how it relates to the twin prime conjecture, in section 2. Mathematicians may be interested as well, as it leads to a potential new path to prove this conjecture. But machine learning readers with little time, not curious about the link to the mathematical aspects, can read section 1 and skip section 2.

I do not prove the twin prime conjecture (yet). Rather, based on data analysis, I provide compelling evidence (the strongest I have ever seen), supporting the fact that it is very likely to be true. It is not based on heuristic or probabilistic arguments (unlike this version dating back to around 1920), but on hard counts and strong patterns.

This is not different from analyzing data and finding that smoking is strongly correlated with lung cancer: the relationship may not be causal as there might be confounding factors. In order to prove causality, more than data analysis is needed (in the case of smoking, of course causality has been firmly established long ago.)

Read full article here

Fuzzy Regression: A Generic, Model-free, Math-free Machine Learning Technique

  A different way to do regression with prediction intervals. In Python and without math. No calculus, no matrix algebra, no statistical eng...