Monday, January 17, 2022

Neat R Plots with the Cairo Graphics Library

As I spent much of my career in business intelligence and marketing, programming in R was never one of my top skills. While I started programming in R more than 30 years ago, it was never one of my core activities either. Yet I have a background in machine learning and image processing, so I am definitely used to producing high quality graphics. Even in Excel, I gained a lot of experience producing such nice plots.

To me, despite the fact (especially decades ago) that R was used mainly to produce all sorts of plots, I’ve found the images produced by R to be of poor quality, but I was too busy on other things to really care about it. Indeed, millions of plots have been generated by millions of programmers since R’s beginning (or its sister language, S+), but the majority of these plots are ugly, even some that are produced today.

I did some research to make sure the proposed fix in this article is not an obscure trick that only people with rusty programming skills would use. It turns out that my fix is widely used. It is based on the Cairo graphics library, which can be used in a variety of programming languages including Python, not just R. Its main feature (the one that attracted me) is its anti-aliasing capabilities when creating shapes such as lines or circles. Without anti-aliasing, lines (for instance) appear as broken segments, and it looks low resolution and pretty ugly: this is still the way it works today when displaying images (on your screen) produced by a command such as plot, using the default R environment.

Before digging into the details, let me show you the contrast between an image with anti-aliasing, and one lacking it.

Read the full article, with solution (just two lines of code!) and fixed version of the above picture, here.

Thursday, December 9, 2021

Poisson-binomial Stochastic Processes: Introduction, Simulations, Inference

In simple English, we introduce a special, off-the-beaten-path type of point process called Poisson-binomial. We analyze its properties and perform simulations to see the distribution of points that it generates, in one and two dimensions, as well as to make inference about its parameters. Statistics of interest include distance to nearest neighbors or interarrival times  (some times called increments).  Combined with radial processes, it can be used to model complex systems involving clustering, for instance the distribution of matter in the universe. Limiting cases are Poisson processes, and this may lead to what is probably the simplest definition of of stochastic, stationary Poisson point process. Our approach is not based on traditional statistical science, but instead on modern machine learning methodology. Some of the tests performed reveal strong patterns invisible to the naked eye.

Probably the biggest takeaway from this article is its tutorial value. It shows how to handle a machine learning problem involving the construction of a new model, from start to finish, with all the steps described at a high level, accessible to professionals with one year worth of academic training in quantitative fields. Numerous references are provided to help the interested reader dive into the technicalities. This article covers a large amount of concepts in a compact style: for instance, stationarity, isotropy, intensity function, paired and unpaired data, order statistics, hidden processes, interarrival times, point count distribution, model fitting, model-free confidence intervals, simulations, radial processes, renewal process, nearest neighbors, model identifiability, compound or hierarchical processes, Mahalanobis distance, rescaling, and more. It can be used as a general introduction to point processes. 

Related probability distributions include logistic, Laplace, uniform, Cauchy, Poisson, Erlang, Poisson-binomial (the distribution these processes are named after), and many that don't have a name. These distributions are used in this article. Part 1 of this article can be found here. The link below points to part 2.

The accompanying spreadsheet has its own tutorial value, as it uses powerful Excel functions that are overlooked by many. Source code is also provided and included in the spreadsheet.

Read the full article here.


1. Unpaired two-dimensional processes

2. Cluster and hierarchical point processes

  • Radial process
  • Basic cluster process
  • The spreadsheet
3. Machine learning inference
  • Interarrival time, and estimation of the intensity
  • Estimation of the scaling factor
  • Confidence intervals, model fitting, and identifiability issues
  • Estimation in higher dimensions
  • Results from some testing, including about radiality
  • Distributions related to the inverse or hidden process
  • Convergence to the Poisson process

Wednesday, December 1, 2021

A Gentle, Original Approach to Stochastic Point Processes

 In this article, original stochastic processes are introduced. They may constitute one of the simplest examples and definitions of point processes. A limiting case is the standard Poisson process - the mother of all point processes - used in so many spatial statistics applications. We first start with the one-dimensional case, before moving to 2-D. Little probability theory  knowledge is required to understand the content. In particular, we avoid discussing measure theory, Palm distributions, and other hard-to-understand concepts that would deter many users. Nevertheless, we dive pretty deep into the details, using simple English rather than arcane abstractions, to show the potential. A spreadsheet with simulations is provided, and model-free statistical inference techniques are discussed, including model-fitting and test for radial distribution. It is shown that two realizations of very different processes can look almost identical to the naked eye, while they actually have fundamental differences that can only be detected with machine learning techniques. Cluster processes are also presented in a very intuitive way.  Various probability distributions, including logistic, Poisson-binomial and Erlang, are attached to these processes, for instance to model the distribution and/or number of points in some area, or distances to nearest neighbors. 

Read full article here

Saturday, September 4, 2021

Machine Learning Perspective on the Twin Prime Conjecture

 This article focuses on the machine learning aspects of the problem, and the use of pattern recognition techniques leading to interesting, new findings about twin primes. Twin primes are prime numbers p such that p + 2 is also prime. For instance, 3 and 5, or 29 and 31. A famous, unsolved and old mathematical conjecture states that there are infinitely many such primes, but a proof still remains elusive to this day. Twin primes are far rarer than primes: there are infinitely more primes than there are twin primes, in the same way that that there are infinitely more integers than there are prime integers.

Here I discuss the results of my experimental math research, based on big data, algorithms, machine learning, and pattern discovery. The level is accessible to all machine learning practitioners. I first discuss my experimentations in section 1, and then how it relates to the twin prime conjecture, in section 2. Mathematicians may be interested as well, as it leads to a potential new path to prove this conjecture. But machine learning readers with little time, not curious about the link to the mathematical aspects, can read section 1 and skip section 2.

I do not prove the twin prime conjecture (yet). Rather, based on data analysis, I provide compelling evidence (the strongest I have ever seen), supporting the fact that it is very likely to be true. It is not based on heuristic or probabilistic arguments (unlike this version dating back to around 1920), but on hard counts and strong patterns.

This is not different from analyzing data and finding that smoking is strongly correlated with lung cancer: the relationship may not be causal as there might be confounding factors. In order to prove causality, more than data analysis is needed (in the case of smoking, of course causality has been firmly established long ago.)

Read full article here

Thursday, May 13, 2021

Fascinating Facts About Complex Random Variables and the Riemann Hypothesis

Despite my long statistical and machine learning career both in academia and in the industry, I never heard of complex random variables until recently, when I stumbled upon them by chance while working on some number theory problem. However, I learned that they are used in several applications, including signal processing, quadrature amplitude modulation, information theory and actuarial sciences. 

In this article, I provide a short overview of the topic, with application to understanding why the Riemann hypothesis (arguably the most famous unsolved mathematical conjecture of all times) might be true, using probabilistic arguments. Stat-of-the-art, recent developments about this conjecture are discussed in a way that most machine learning professionals can understand. The style of my presentation is very compact, with numerous references provided as needed. It is my hope that this will broaden the horizon of the reader, offering new modeling tools to her arsenal, and an off-the-beaten-path reading. The level of mathematics is rather simple and you need to know very little (if anything) about complex numbers. After all, these random variables can be understood as bivariate vectors (X, Y) with X representing the real part and Y the imaginary part. They are typically denoted as Z = X + iY, where the complex number i (whose square is equal to -1) is the imaginary unit. There are some subtle differences with bivariate real variables. The complex Gaussian variable is of course the most popular case.

Read full article here.

Sunday, August 30, 2020

The Exponential Mean: Alternative to Classic Means

 Given n observations x1, ..., xn, the generalized mean (also called power mean) is defined as 

The case p = 1 corresponds to the traditional arithmetic mean, while p = 0 yields the geometric mean, and p = -1 yields the harmonic mean. See here for details. This metric is favored by statisticians. It is a particular case of the quasi-arithmetic mean

Here I introduce another kind of mean called exponential mean, also based on a parameter p, that may have an appeal to data scientists and machine learning professionals. It is also a special case of the quasi-arithmetic mean. Though the concept is basic, there is very little if any literature about it. It is related to the LogSumExp and the Log semiring. It is defined as follows:

Here the logarithm is in base p, with p positive. When p tends to 0, mp is the minimum of the observations. When p tends to 1, it yields the classic arithmetic mean, and as p tends to infinity, it yields the maximum of the observations. 

Content of this article

  • Advantages of the exponential mean
  • Illustration on a test data set
  • Illustration on a test data set
  • Doubly exponential mean

Read the full article here. 

Friday, June 5, 2020

Bernouilli Lattice Models - Connection to Poisson Processes

Bernouilli lattice processes may be one of the simplest examples of point processes, and can be used as an introduction to learn about more complex spatial processes that rely on advanced measure theory for their definition. In this article, we show the differences and analogies between Bernouilli lattice processes on the standard rectangular or hexagonal grid, and the Poisson process, including convergence of discrete lattice processes to continuous Poisson process, mainly in two dimensions. We also illustrate that even though these lattice processes are purely random, they don't look random when seen with the naked eye.  
We discuss basic properties such as the distribution of the number of points in any given area, or the distribution of the distance to the nearest neighbor. Bernouilli lattice processes have been used as models in financial problems. Most of the papers on this topic are hard to read, but here we discuss the concepts in simple English. Interesting number theory problems about sums of squares, deeply related to these lattice processes, are also discussed. Finally, we show how to identify if a particular realization is from a Bernouilli lattice process, a Poisson process, or a combination of both. 
See below a realization of a Bernouilli process on the regular hexagonal lattice. The main feature of such a process is that the point locations are fixed, not random. But whether a point is "fired" or not (that is, marked in blue) is purely random and independent of whether any other point is fired or not. The probability for a point to be fired is a Bernouilly variable of parameter p
Figure 1: realization of Bernouilli hexagonal lattice process
More sophisticated models, known as Markov random fields, allows for neighboring points to be correlated. They are useful in image analysis.
To the contrary, Poisson processes assume that the point locations are random. The points being fired are uniformly distributed on the plane, and not restricted to integer or grid coordinates. In short, Bernouilli lattice processes are discrete approximations to Poisson processes. Below is an example of a realization of a Poisson process.
Math / data science articles by the same author:
Here is a selection of articles pertaining to experimental math and data science:

Neat R Plots with the Cairo Graphics Library

As I spent much of my career in business intelligence and marketing, programming in R was never one of my top skills. While I started progra...