Sunday, May 22, 2022

Fuzzy Regression: A Generic, Model-free, Math-free Machine Learning Technique

A different way to do regression with prediction intervals. In Python and without math. No calculus, no matrix algebra, no statistical engineering, no regression coefficients, no bootstrap. Multivariate and highly non-linear. Interpretable and illustrated on synthetic data. Read more here.

For years, I have developed machine learning techniques that barely use any mathematics. I view it as a sport. Not that I don’t know anything about mathematics, quite the contrary. I believe you must be very math-savvy to achieve such accomplishments. This article epitomizes math-free machine learning. It is the result of years of research. The highly non-linear methodology described here may not be easier to grasp than math-heavy techniques. It has its own tricks. Yet, you could, in principle, teach it to middle school students.

Fuzzy regression with prediction intervals, original version, 1D

I did not in any way compromise on the quality and efficiency of the technique, for the sake of gaining the “math-free” label. What I describe here is a high performing technique in its own right. You can use it to solve various problems: multivariate regression, interpolation, data compression, prediction, or spatial modeling (well, without “model”). It comes with prediction intervals. Yet there is no statistical or probability model behind it, no calculus, no matrix algebra, no regression coefficients, no bootstrapping, no resampling, not even square roots.

Read the full article, and access the full technical report, Python code and data sets (all, free, no sign-up required), from here.

Tuesday, May 17, 2022

New Book: Approaching (Almost) Any Machine Learning Problem

This self-published book is dated July 2020 according to Amazon. But it appears to be an ongoing project. Like many new books, the material is on GitHub. The most recent version, dated June 2021, is available in PDF format.

This is not a traditional book. It feels like a repository of Python code, printed on paper if you buy the print version. The associated GitHub repository is much more useful if you want to re-use the code with simple copy and paste. It covers a lot of topics and performance metrics, with emphasis on computer vision problems. The code is documented in details. The code represents 80% of the content, and the comments in the code should be considered as an important, integral part of the content.

That said, the book is not an introduction to machine learning algorithms. It assumes some knowledge of the algorithms discussed, and there is no mathematical explanations. I find it to be an excellent 300-page Python tutorial covering many ML topics (maybe too many). The author focuses on real problems and real data. The style is very far from academic, and in my opinion, anti-academic.

Tuesday, May 10, 2022

How to Create/Use Great Synthetic Data for Interpretable Machine Learning

I share here my new article on synthetic data and interpretable machine learning. It will show you how to set up such an environment. I also mention three popular books published in the last three months. The figure below is from the first article featured in this newsletter.

Article: synthetic data and interpretable machine learning. This first article in a new series on synthetic data and explainable AI, focuses on making linear regression more meaningful and controllable. Includes synthetic data, advanced machine learning with Excel, combinatorial feature selection, parametric bootstrap, cross-validation, and alternatives to R-squared to measure model performance. The full technical article (PDF, 13 pages, with detailed explanations and […]. Read more here.

New book: Interpretable Machine Learning. Subtitled “A Guide for Making Black Box Models Explainable”. Authored and self-published by Christoph Molnar, 2022 (319 pages). This is actually the second edition, the first one was published in 2019. According to Google Scholar, it was cited more than 2,500 times. So this is a popular book about a popular topic. General Comments The […]. Read my review here.

New book: Efficient Deep Learning. Subtitled “Fast, smaller, and better models”. This book goes through algorithms and techniques used by researchers and engineers at Google Research, Facebook AI Research (FAIR), and other eminent AI labs to train and deploy their models on devices ranging from large server-side machines to tiny microcontrollers. The book presents a balance of fundamentals as well […] Read more here.

New book: Probabilistic Machine Learning. By Kevin Murphy, MIT Press (2022). This is one of the best machine learning books that I purchased in the last few years. Very comprehensive, covering a lot of statistical science too. The level is never too high, despite a few advanced concepts being discussed. There is a lot of focus on applications, especially image […] Read my review here.

Browse the MLTechniques.com blog by category to find more content that is relevant to you. For instance, articles in the synthetic data category can be found here. The resources section, here, features detailed technical reports and other books, some available to subscribers only, some available to all.

Monday, April 25, 2022

Upcoming Books and Articles on MLTechniques.com

Here I share my roadmap for the next 12 months. While I am also looking for external contributors and authors to add more variety, my focus — as far as my technical content is concerned — is to complete the following projects and publish the material on this platform.

Summary

All my blog posts will be available to everyone. Some technical papers (in PDF format) may be offered to subscribers only (you can subscribe here). My plan is to also produce books focusing on specific topics, covering material from several articles in a self-contained unified package. They will be available on our e-Store.

Various themes will be covered, including synthetic data, new regression techniques, clustering and classification, data animations, sound, “no code” machine learning, explainable AI and very deep neural networks, a zoo of probability distributions, Excel for machine learning, experimental math, innovative machine learning, and off-the-beaten path exercises. Read the full article here.

On a different note, my promised article on shape recognition — part of this larger publishing project — is now live. You can find it here (see the section "Free Books and Articles" after following the link). Below is the abstract.

Abstract for the shape recognition article:

I define the mathematical concept of shape and shape signature in two dimensions, using parametric polar equations. The signature uniquely characterizes the shape, up to a translation or scale factor. In practical applications, the data set consists of points or pixels located on the shape, rather than the curve itself. If these points are not properly sampled - if they are not uniformly distributed on the curve - they need to be re-weighted to compute a meaningful centroid of the shape, and to perform shape comparisons. I discuss the weights, and then introduce metrics to compare shapes (observed as sets of points or pixels in an image). These metrics are related to the Hausdorff distance. I also introduce a correlation distance between two shapes. Equipped with these metrics, one can perform shape recognition or classification using training sets of arbitrary sizes. I use synthetic data in the applications. It allows you to see how the classifier performs, to discriminate between two very similar shapes, or in the presence of noise. Rotation-invariant metrics are also discussed.

Wednesday, March 16, 2022

Advanced Machine Learning with Basic Excel - Part 1

Learn advanced machine learning techniques using Excel. No coding required.

It is amazing what you can do with a simple tool such as Excel. In this series, I share some of my spreadsheets. They cover many topics, including multiple types of regression, model-free confidence intervals, resampling, an original technique known as hidden decision trees, scatter plots with multiple groups, advanced visualization techniques, and more.

No plug-in is required. I don't use macros, pivot tables or any advanced Excel feature. In part 1 (this article), I cover the following techniques:

Tuesday, March 15, 2022

Machine Learning Textbook: Stochastic Processes and Simulations (Volume 1)

Introduction

This scratch course on stochastic processes covers significantly more material than usually found in traditional books or classes. The approach is original: I introduce a new yet intuitive type of random structure called perturbed lattice or Poisson-binomial process, as the gateway to all the stochastic processes. Such models have started to gain considerable momentum recently, especially in sensor data, cellular networks, chemistry, physics and engineering applications. I present state-of-the-art material in simple words, in a compact style, including new research developments and open problems. I focus on the methodology and principles, providing the reader with solid foundations and numerous resources: theory, applications, illustrations, statistical inference, references, glossary, educational spreadsheet, source code, stochastic simulations, original exercises, videos and more.

Below is a short selection highlighting some of the topics featured in the textbook. Some are research results published here for the first time.

• GPU clustering: Fractal supervised clustering in GPU (graphics processing unit) using image filtering techniques akin to neural networks, automated black-box detection of the number of clusters, unsupervised clustering in GPU using density (gray levels) equalizer.
• Inference: New test of independence, spatial processes, model fitting, dual confidence regions, minimum contrast estimation, oscillating estimators, mixture and surperimposed models, radial cluster processes, exponential-binomial distribution with infinitely many parameters, generalized logistic distribution.
• Nearest neighbors: Statistical distribution of distances and Rayleigh test, Weibull distribution, properties of nearest neighbor graphs, size distribution of connected components, geometric features, hexagonal lattices, coverage problems, simulations, model-free inference.
• Cool stuff: Random functions, random graphs, random permutations, chaotic convergence, perturbed Riemann Hypothesis (experimental number theory), attractor distributions in extreme value theory, central limit theorem for stochastic processes, numerical stability, optimum color palettes, cluster processes on the sphere.
• Resources: 28 exercises with solution expanding the theory and methods presented in the textbook, well documented source code and formulas to generate various deviates and simulations, simple recipes (with source code) to design your own data animations as MP4 videos - see ours on YouTube, here.

Volume 1

This first volume deals with point processes in one and two dimensions, including spatial processes and clustering. The next volume in this series will cover other types of stochastic processes, such as Brownian-related and random, chaotic dynamical systems. The point process which is at the core of this textbook is called the Poisson-binomial process (not to be confused with a binomial nor a Poisson process) for reasons that will soon become apparent to the reader. Two extreme cases are the standard Poisson process, and fixed (non-random) points on a lattice. Everything in between is the most exciting part.

Target Audience

College-educated professionals with an analytical background (physics, economics, finance, machine learning, statistics, computer science, quant, mathematics, operations research, engineering, business intelligence), students enrolled in a quantitative curriculum, decision makers or managers working with data scientists, graduate students, researchers and college professors, will benefit the most from this textbook. The textbook is also intended to professionals interested in automated machine learning and artificial intelligence.

It includes many original exercises requiring out-of-the-box thinking, and offered with solution. Both students and college professors will find them very valuable. Most of these exercises are an extension of the core material. Also, a large number of internal and external references are immediately accessible with one click, throughout the textbook: they are highlighted respectively in red and blue in the text. The material is organized to facilitate the reading in random order as much as possible and to make navigation easy. It is written for busy readers.

The textbook includes full source code, in particular for simulations, image processing, and video generation. You don't need to be a programmer to understand the code. It is well documented and easy to read, even for people with little or no programming experience. Emphasis is on good coding practices. The goal is to help you quickly develop and implement your own machine learning applications from scratch, or use the ones offered in the textbook.

The material also features professional-looking spreadsheets allowing you to perform interactive statistical tests and simulations in Excel alone, without statistical tables or any coding. The code, data sets, videos and spreadsheets are available on my GitHub repository, here.

The content in this textbook is frequently of graduate or post-graduate level and thus of interest to researchers. Yet the unusual style of the presentation makes it accessible to a large audience, including students and professionals with a modest analytic background (a standard course in statistics). It is my hope that it will entice beginners and practitioners faced with data challenges, to explore and discover the beautiful and useful aspects of the theory, traditionally inaccessible to them due to jargon.

Vincent Granville, PhD is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), former VC-funded executive, author and patent owner. Vincent's past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace and other Internet startup companies (one acquired by Google). Vincent is also a former post-doct from Cambridge University, and the National Institute of Statistical Sciences (NISS). He is currently publisher at DataShaping.com, and working on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence, among others. He is also the author of multiple books, including Statistics: New Foundations, Toolbox, and Machine Learning Recipes, and Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems.

How to Obtain the Book?

The book is available on our e-store, here. View the table of contents, bibliography, index, list of figures and exercises here on my GitHub repository. To view the full list of books, visit MachineLearningRecipes.com.

Hidden decision trees revisited (2013)

This is a revised version of an earlier article posted on AnalyticBridge.

Hidden decision trees (HDT) is a technique patented by Dr. Granville, to score large volumes of transaction data. It blends robust logistic regression with hundreds small decision trees (each one representing for instance a specific type of fraudulent transaction) and offers significant advantages over both logistic regression and decision trees: robustness, ease of interpretation, and no tree pruning, no node splitting criteria. It makes this methodology powerful and easy to implement even for someone with no statistical background.

Hidden Decision Trees is a statistical and data mining methodology (just like logistic regression, SVM, neural networks or decision trees) to handle problems with large amounts of data, non-linearity and strongly correlated independent variables.

The technique is easy to implement in any programming language. It is more robust than decision trees or logistic regression, and helps detect natural final nodes. Implementations typically rely heavily on large, granular hash tables.

No decision tree is actually built (thus the name hidden decision trees), but the final output of a hidden decision tree procedure consists of a few hundred nodes from multiple non-overlapping small decision trees. Each of these parent (invisible) decision trees corresponds e.g. to a particular type of fraud, in fraud detection models. Interpretation is straightforward, in contrast with traditional decision trees.

The methodology was first invented in the context of credit card fraud detection, back in 2003. It is not implemented in any statistical package at this time. Frequently, hidden decision trees are combined with logistic regression in an hybrid scoring algorithm, where 80% of the transactions are scored via hidden decision trees, while the remaining 20% are scored using a compatible logistic regression type of scoring.

Hidden decision trees take advantage of the structure of large multivariate features typically observed when scoring a large number of transactions, e.g. for fraud detection. The technique is not connected with hidden Markov fields.

Potential Applications

• Fraud detection, spam detection
• Web analytics
• Keyword scoring/bidding (ad networks, paid search)
• Transaction scoring (click, impression, conversion, action)
• Click fraud detection
• Collective filtering (social network analytics)
• Relevancy algorithms
• Text mining
• Scoring and ranking algorithms
• Infringement detection
• User feedback: automated  clustering

Implementation

The model presented here is used in the context of click scoring. The purpose is to create predictive scores, where score = f(response), that is, score is a function of the response. The response is sometimes referred to as the dependent variable in statistical and predictive models.

• Examples of Response:
• Odds of converting (Internet traffic data – hard/soft conversions)
• CR (conversion rate)
• Probability that transaction is fraudulent
• Independent variables: Called features or rules. They are highly correlated

Traditional models to be compared with hidden decision trees include logistic regression, decision trees, naïve Bayes.

Hidden decision trees (HDT) use a one-to-one mapping between scores and multivariate features. A multivariate feature is a rule combination attached to a particular transaction (that is, a vector specifying which rules are triggered, which ones are not), and is sometimes referred to as flag vector or node.

HDT fundamentals, based on typical data set:

• If we use 40 binary rules, we have 2 at power 40 potential multivariate features
• If training set has 10 MM transactions, we will obviously observe 10MM multivariate features at most, a number much smaller than 2 at power 40
• 500 out of 10MM features account to 80% of all transactions
• The top 500 multivariate features have strong predictive power
• An alternate algorithm is required to classify the 20% remaining transactions
• Using neighboring top multivariate features to score the 20% remaining transactions creates bias, as rare multivariate features (sometimes not found in the training set) corresponds to observation that are worse than average, with a low score (because they trigger many fraud detection rules).

Implementation details

Each top node (or multivariate feature) is a final node from a hidden decision tree. There is no need for tree pruning / splitting algorithms and criteria: HDT is straightforward, fast, and can rely on efficient hash tables (where key=feature, value=score). The top 500 nodes, used to classify (that is, score) 80% of transactions, come from multiple hidden decision trees - hidden because you never used a decision tree algorithm to produce them.

The remaining 20% transactions scored using alternate methodology (typically, logistic regression). Thus HDT is a hybrid algorithm, blending multiple, small, easy-to-interpret, invisible decision trees (final nodes only) with logistic regression.

Note that in the logistic regression, we use constrained regression coefficients. These coefficients depend on 2 or 3 top parameters and have the same sign as the correlation between the rule they represent, and the response or score. This make the regression non-sensitive to high cross correlations among the “independent” variables (rules) which are indeed not independent in this case. This approach is similar to ridge regressionlogic regression or Lasso regression. The regression is used to fine tune the top parameters associated with regression coefficients. I will later in this book show that approximate solutions (we are doing approximate logistic regression here) are - if well designed - almost as accurate as exact solutions, but can be far more robust.

Score blending

We are dealing with two types of scores:

• The top 500 nodes provide a score S1 available for 80% of the transactions
• The logistic regression provides a score S2 available for 100% of the transactions

To blend the scores,

• Rescale S2 using the 80% transactions that have two scores S1 and S2. Rescaling means apply a linear transformation so that both scores have same mean and same variance.  Let S3 be the rescaled S2.
• Transactions that can’t be scored with S1 are scored with S3

HDT nodes provide an alternate segmentation of the data. One large, medium-score segment corresponds to neutral transactions (triggering no rule). Segments with very low scores correspond to specific fraud cases. Within each segment, all transactions have the same score. HDT’s provide a different type of segmentation than PCA (principal component analysis) and other analyses.

HDT History

• 2003: First version applied to credit card fraud detection
• 2006: Application to click scoring and click fraud detection
• 2008: More advanced versions to handle granular and very large data sets
• Hidden Forests: multiple HDT’s, each one applied to a cluster of correlated rules
• Hierarchical HDT’s: the top structure, not just rule clusters, is modeled using HDT’s
• Non binary rules (naïve Bayes blended with HDT)

Example: Scoring Internet Traffic

The figure below shows the score distribution with a system based on 20 rules, each one having a low triggering frequency. It has the following features:

• Reverse bell curve
• Scores below 425 correspond to un-billable transactions
• Spike at the very bottom and very top of the score scale
• 50% of all transactions have good scores
• Scorecard parameters
• A drop of 50 points represents a 50% drop in conversion rate:
• Average score is 650.
• Model improvement: from reverse bell curve to bell curve
• Transaction quality vs. fraud detection
• Add anti-rules, perform score smoothing (will also remove score caking)

Figure 5.1: Example of score distribition based on HDT’s

The figure below compares scores with conversion rates (CR). HDT’s were applied to Internet data, scoring clicks with a score used to predict chances of conversion (a conversion being a purchase, a click out, sign-up on some landing page).  Overall,  we have a rather good fit.

Peaks in the figure below could mean:

• Bogus conversions (happens a lot if conversion is simply a click out)
• Residual Noise
• Model needs improvement (incorporate anti-rules)

Valleys could mean:

• Undetected conversions (cookie issue, time-to-conversion  unusually high)
• Residual noise
• Model needs improvement

Peaks and valleys can also be cause if you blend together multiple types of conversions: traffic with 0.5% CTR together with traffic with 10% CTR.

Figure 5.2: HDT scores to predict conversions

Conclusions

HDT is a fast algorithm, easy to implement, can handle large data sets efficiently, and the output is easy to interpret.

It is non parametric and robust. The risk of over-fitting is small if no more than top 500 nodes are selected and ad-hoc cross validation techniques used to remove unstable nodes. It offers built-in, simple mechanism to compute confidence intervals for scores. See also next section.

HDT is hybrid algorithm to detect multiple types of structures: linear structures via the regression, and non linear structures via the top nodes.

Future directions

• Hidden forests to handle granular data
• Hierarchical HDT’s

Fuzzy Regression: A Generic, Model-free, Math-free Machine Learning Technique

A different way to do regression with prediction intervals. In Python and without math. No calculus, no matrix algebra, no statistical eng...