Tuesday, March 15, 2022

How to build simple, accurate, data-driven, model-free confidence intervals (2011)

 An updated version with source code and detailed explanations can be found here.

If observations from a specific experiment (for instance, scores computed on 10 million credit card transactions) are assigned a random bin ID (labeled 1, ··· ,k), then you can easily build a confidence interval for any proportion or score computed on these k random bins, using the Analyticridge theorem.

The proof of this theorem relies on complicated combinatorial arguments and the use of the Beta function. Note that the final result does not depend on the distribution associated with your data - in short, your data does not have to follow a Gaussian (a.k.a normal) or any prespecified statistical distribution, to make the confidence intervals valid. You can find more details regarding the proof of the theorem in the book Statistics of Extremes by E.J. Gumbel, pages 58-59 (Dover edition, 2004)

Parameters in the Analyticbridge theorem can be chosen to achieve the desired level of precision - e.g a 95%, 99% or 99.5% confidence interval. The theorem will also tell you what your sample size should be to achieve a pre-specified accuracy level. This theorem is a fundamental result to compute simple, per-segment, data-driven, model-free confidence intervals in many contexts, in particular when generating predictive scores produced via logistic / ridge regression or decision trees / hidden decision trees (e.g. for fraud detection, consumer or credit scoring).


A scoring system designed to detect customers likely to fail on a loan, is based on a rule set. On average, for an individual customer, the probability to fail is 5%. In a data set with 1 million observations (customers) and several metrics such as credit score, amount of debt, salary, etc. if we randomly select 99 bins each containing 1,000 customers, the 98% confidence interval (per bin of 1,000 customers) for the failure rate is (say) [4.41%, 5.53%], based on the  Analyticridge theorem, with k = 99 and m = 1 (read the theorem to understand what k and m mean - it's actually very easy to understand the signification of these parameters).

Now, looking at a non-random bin with 1,000 observations, consisting of customers with credit score < 650 and less than 26 years old, we see that the failure rate is 6.73%. We can thus conclude that the rule credit score < 650 and less than 26 years older is actually a good rule to detect failure rate, because 6.73% is well above the upper bound of the [4.41%, 5.53%] confidence interval.

Indeed, we could test hundreds of rules, and easily identify rules with high predictive power, by systematically and automatically looking at how far the observed failure rate (for a given rule) is from a standard confidence interval. This allows us to rule out effect of noise, and process and rank numerous rules (based on their predictive power - that is, how much their failure rate is above the confidence interval upper bound) at once.

Old Version (2009)

This is a simple technique. Let's say you want to estimate a parameter p (proportion, mean, it does not matter).

* Divide your observations into N random buckets
* Compute the estimated value for each bucket
* Rank these estimates, from p_1 (smallest value) to p_N (largest value)
* Let p_k be your confidence interval lower bound for p, with k less than N/2
* Let p_(N-k+1) be your confidence interval upper bound for p


* [p_k,p_(N-k+1)] is a non parametric confidence interval for p
* The confidence level is 2 k/(N+1)

Has anyone tried to use this formula? Or performed simulation (e.g. with a normal distribution) to double check that it is correct? Note that by trying multiple values for k, (and for N although this is more computer intensive), it is possible to interpolate confidence intervals of any levels.

Finally, you want to keep N as low as possible (and k=1 ideally) to achieve the desired confidence interval level. For instance, for a 90% confidence interval, N=19 and k=1 work and are optimum.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Fuzzy Regression: A Generic, Model-free, Math-free Machine Learning Technique

  A different way to do regression with prediction intervals. In Python and without math. No calculus, no matrix algebra, no statistical eng...