Wednesday, March 16, 2022

Advanced Machine Learning with Basic Excel - Part 1

Learn advanced machine learning techniques using Excel. No coding required.

It is amazing what you can do with a simple tool such as Excel. In this series, I share some of my spreadsheets. They cover many topics, including multiple types of regression, model-free confidence intervals, resampling, an original technique known as hidden decision trees, scatter plots with multiple groups, advanced visualization techniques, and more.

No plug-in is required. I don't use macros, pivot tables or any advanced Excel feature. In part 1 (this article), I cover the following techniques:

Read the full article here, on MachineLearningRecipes.com.

Tuesday, March 15, 2022

Machine Learning Textbook: Stochastic Processes and Simulations (Volume 1)

The book is available on our e-store, here. View the table of contents, bibliography, index, list of figures and exercises here on my GitHub repository. To view the full list of books, visit MachineLearningRecipes.com. To receive updates about our future books sign up here.

Introduction

This scratch course on stochastic processes covers significantly more material than usually found in traditional books or classes. The approach is original:  I introduce a new yet intuitive type of random structure called perturbed lattice or Poisson-binomial process, as the gateway to all the stochastic processes. Such models have started to gain considerable momentum recently, especially in sensor data, cellular networks, chemistry, physics and engineering applications. I present state-of-the-art material in simple words, in a compact style, including new research developments and open problems. I focus on the methodology and principles, providing the reader with solid foundations and numerous resources: theory, applications, illustrations, statistical inference, references, glossary, educational spreadsheet, source code, stochastic simulations, original exercises, videos and more.

Highlights

Below is a short selection highlighting some of the topics featured in the textbook. Some are research results published here for the first time. 

GPU clustering: Fractal supervised clustering in GPU (graphics processing unit) using image filtering techniques akin to neural networks, automated black-box detection of the number of clusters, unsupervised clustering in GPU using density (gray levels) equalizer.

Inference: New test of independence, spatial processes, model fitting, dual confidence regions, minimum contrast estimation, oscillating estimators, mixture and surperimposed models, radial cluster processes, exponential-binomial distribution with infinitely many parameters, generalized logistic distribution.

Nearest neighbors: Statistical distribution of distances and Rayleigh test, Weibull distribution, properties of nearest neighbor graphs, size distribution of connected components, geometric features, hexagonal lattices, coverage problems, simulations, model-free inference.

Cool stuff: Random functions, random graphs, random permutations, chaotic convergence, perturbed Riemann Hypothesis (experimental number theory), attractor distributions in extreme value theory, central limit theorem for stochastic processes, numerical stability, optimum color palettes, cluster processes on the sphere.

Resources: 28 exercises with solution expanding the theory and methods presented in the textbook, well documented source code and formulas to generate various deviates and simulations, simple recipes (with source code) to design your own data animations as MP4 videos - see ours on YouTube, here.

Volume 1

This first volume deals with point processes in one and two dimensions, including spatial processes and clustering. The next volume in this series will cover other types of stochastic processes, such as Brownian-related and random, chaotic dynamical systems. The point process which is at the core of this textbook is called the Poisson-binomial process (not to be confused with a binomial nor a Poisson process) for reasons that will soon become apparent to the reader. Two extreme cases are the standard Poisson process, and fixed (non-random) points on a lattice. Everything in between is the most exciting part.

Target Audience

College-educated professionals with an analytical background (physics, economics, finance, machine learning, statistics, computer science, quant, mathematics, operations research, engineering, business intelligence), students enrolled in a quantitative curriculum, decision makers or managers working with data scientists, graduate students, researchers and college professors, will benefit the most from this textbook. The textbook is also intended to professionals interested in automated machine learning and artificial intelligence.

It includes many original exercises requiring out-of-the-box thinking, and offered with solution. Both students and college professors will find them very valuable. Most of these exercises are an extension of the core material. Also, a large number of internal and external references are immediately accessible with one click, throughout the textbook: they are highlighted respectively in red and blue in the text. The material is organized to facilitate the reading in random order as much as possible and to make navigation easy. It is written for busy readers.

The textbook includes full source code, in particular for simulations, image processing, and video generation. You don't need to be a programmer to understand the code. It is well documented and easy to read, even for people with little or no programming experience. Emphasis is on good coding practices. The goal is to help you quickly develop and implement your own machine learning applications from scratch, or use the ones offered in the textbook.

The material also features professional-looking spreadsheets allowing you to perform interactive statistical tests and simulations in Excel alone, without statistical tables or any coding. The code, data sets, videos and spreadsheets are available on my GitHub repository, here.

The content in this textbook is frequently of graduate or post-graduate level and thus of interest to researchers. Yet the unusual style of the presentation makes it accessible to a large audience, including students and professionals with a modest analytic background (a standard course in statistics). It is my hope that it will entice beginners and practitioners faced with data challenges, to explore and discover the beautiful and useful aspects of the theory, traditionally inaccessible to them due to jargon.

About the Author

Vincent Granville, PhD is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), former VC-funded executive, author and patent owner. Vincent's past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace and other Internet startup companies (one acquired by Google). Vincent is also a former post-doct from Cambridge University, and the National Institute of Statistical Sciences (NISS). He is currently publisher at DataShaping.com, and working on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence, among others. He is also the author of multiple books, including Statistics: New Foundations, Toolbox, and Machine Learning Recipes, and Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems.

How to Obtain the Book?

The book is available on our e-store, here. View the table of contents, bibliography, index, list of figures and exercises here on my GitHub repository. To view the full list of books, visit MachineLearningRecipes.com.

Hidden decision trees revisited (2013)

This is a revised version of an earlier article posted on AnalyticBridge.

Hidden decision trees (HDT) is a technique patented by Dr. Granville, to score large volumes of transaction data. It blends robust logistic regression with hundreds small decision trees (each one representing for instance a specific type of fraudulent transaction) and offers significant advantages over both logistic regression and decision trees: robustness, ease of interpretation, and no tree pruning, no node splitting criteria. It makes this methodology powerful and easy to implement even for someone with no statistical background.

Hidden Decision Trees is a statistical and data mining methodology (just like logistic regression, SVM, neural networks or decision trees) to handle problems with large amounts of data, non-linearity and strongly correlated independent variables.

The technique is easy to implement in any programming language. It is more robust than decision trees or logistic regression, and helps detect natural final nodes. Implementations typically rely heavily on large, granular hash tables.

No decision tree is actually built (thus the name hidden decision trees), but the final output of a hidden decision tree procedure consists of a few hundred nodes from multiple non-overlapping small decision trees. Each of these parent (invisible) decision trees corresponds e.g. to a particular type of fraud, in fraud detection models. Interpretation is straightforward, in contrast with traditional decision trees.

The methodology was first invented in the context of credit card fraud detection, back in 2003. It is not implemented in any statistical package at this time. Frequently, hidden decision trees are combined with logistic regression in an hybrid scoring algorithm, where 80% of the transactions are scored via hidden decision trees, while the remaining 20% are scored using a compatible logistic regression type of scoring.

Hidden decision trees take advantage of the structure of large multivariate features typically observed when scoring a large number of transactions, e.g. for fraud detection. The technique is not connected with hidden Markov fields.

Potential Applications

Fraud detection, spam detection
Web analytics
- Keyword scoring/bidding (ad networks, paid search)
- Transaction scoring (click, impression, conversion, action)
- Click fraud detection
- Web site scoring, ad scoring, landing page / advertiser scoring
- Collective filtering (social network analytics)
- Relevancy algorithms
Text mining
- Scoring and ranking algorithms
- Infringement detection
- User feedback: automated clustering

Implementation

The model presented here is used in the context of click scoring. The purpose is to create predictive scores, where score = f(response), that is, score is a function of the response. The response is sometimes referred to as the dependent variable in statistical and predictive models.

Examples of Response:
- Odds of converting (Internet traffic data – hard/soft conversions)
- CR (conversion rate)
- Probability that transaction is fraudulent
Independent variables: Called features or rules. They are highly correlated

Traditional models to be compared with hidden decision trees include logistic regression, decision trees, naïve Bayes.

Hidden decision trees (HDT) use a one-to-one mapping between scores and multivariate features. A multivariate feature is a rule combination attached to a particular transaction (that is, a vector specifying which rules are triggered, which ones are not), and is sometimes referred to as flag vector or node.

HDT fundamentals, based on typical data set:

If we use 40 binary rules, we have 2 at power 40 potential multivariate features
If training set has 10 MM transactions, we will obviously observe 10MM multivariate features at most, a number much smaller than 2 at power 40
500 out of 10MM features account to 80% of all transactions
The top 500 multivariate features have strong predictive power
An alternate algorithm is required to classify the 20% remaining transactions
Using neighboring top multivariate features to score the 20% remaining transactions creates bias, as rare multivariate features (sometimes not found in the training set) corresponds to observation that are worse than average, with a low score (because they trigger many fraud detection rules).

Implementation details

Each top node (or multivariate feature) is a final node from a hidden decision tree. There is no need for tree pruning / splitting algorithms and criteria: HDT is straightforward, fast, and can rely on efficient hash tables (where key=feature, value=score). The top 500 nodes, used to classify (that is, score) 80% of transactions, come from multiple hidden decision trees - hidden because you never used a decision tree algorithm to produce them.

The remaining 20% transactions scored using alternate methodology (typically, logistic regression). Thus HDT is a hybrid algorithm, blending multiple, small, easy-to-interpret, invisible decision trees (final nodes only) with logistic regression.

Note that in the logistic regression, we use constrained regression coefficients. These coefficients depend on 2 or 3 top parameters and have the same sign as the correlation between the rule they represent, and the response or score. This make the regression non-sensitive to high cross correlations among the “independent” variables (rules) which are indeed not independent in this case. This approach is similar to ridge regression, logic regression or Lasso regression. The regression is used to fine tune the top parameters associated with regression coefficients. I will later in this book show that approximate solutions (we are doing approximate logistic regression here) are - if well designed - almost as accurate as exact solutions, but can be far more robust.

Score blending

We are dealing with two types of scores:

The top 500 nodes provide a score S1 available for 80% of the transactions
The logistic regression provides a score S2 available for 100% of the transactions

To blend the scores,

Rescale S2 using the 80% transactions that have two scores S1 and S2. Rescaling means apply a linear transformation so that both scores have same mean and same variance. Let S3 be the rescaled S2.
Transactions that can’t be scored with S1 are scored with S3

HDT nodes provide an alternate segmentation of the data. One large, medium-score segment corresponds to neutral transactions (triggering no rule). Segments with very low scores correspond to specific fraud cases. Within each segment, all transactions have the same score. HDT’s provide a different type of segmentation than PCA (principal component analysis) and other analyses.

HDT History

2003: First version applied to credit card fraud detection
2006: Application to click scoring and click fraud detection
2008: More advanced versions to handle granular and very large data sets
- Hidden Forests: multiple HDT’s, each one applied to a cluster of correlated rules
- Hierarchical HDT’s: the top structure, not just rule clusters, is modeled using HDT’s
- Non binary rules (naïve Bayes blended with HDT)

Example: Scoring Internet Traffic

The figure below shows the score distribution with a system based on 20 rules, each one having a low triggering frequency. It has the following features:

Reverse bell curve
Scores below 425 correspond to un-billable transactions
Spike at the very bottom and very top of the score scale
50% of all transactions have good scores
Scorecard parameters
- A drop of 50 points represents a 50% drop in conversion rate:
- Average score is 650.
Model improvement: from reverse bell curve to bell curve
- Transaction quality vs. fraud detection
- Add anti-rules, perform score smoothing (will also remove score caking)

Figure 5.1: Example of score distribition based on HDT’s

The figure below compares scores with conversion rates (CR). HDT’s were applied to Internet data, scoring clicks with a score used to predict chances of conversion (a conversion being a purchase, a click out, sign-up on some landing page). Overall, we have a rather good fit.

Peaks in the figure below could mean:

Bogus conversions (happens a lot if conversion is simply a click out)
Residual Noise
Model needs improvement (incorporate anti-rules)

Valleys could mean:

Undetected conversions (cookie issue, time-to-conversion unusually high)
Residual noise
Model needs improvement

Peaks and valleys can also be cause if you blend together multiple types of conversions: traffic with 0.5% CTR together with traffic with 10% CTR.

Figure 5.2: HDT scores to predict conversions

Conclusions

HDT is a fast algorithm, easy to implement, can handle large data sets efficiently, and the output is easy to interpret.

It is non parametric and robust. The risk of over-fitting is small if no more than top 500 nodes are selected and ad-hoc cross validation techniques used to remove unstable nodes. It offers built-in, simple mechanism to compute confidence intervals for scores. See also next section.

HDT is hybrid algorithm to detect multiple types of structures: linear structures via the regression, and non linear structures via the top nodes.

Future directions

Hidden forests to handle granular data
Hierarchical HDT’s

Fast Combinatorial Feature Selection with New Definition of Predictive Power (2013)

In this article, I proposes a simple metric to measure predictive power. It is used for combinatorial feature selection, where a large number of feature combinations need to be ranked automatically and very fast, for instance in the context of transaction scoring, in order to optimize predictive models. This is about rather big data, and we would like to see an Hadoop methodology for the technology proposed here. It can easily be implemented in a Map Reduce framework. It was developed by the author in the context of credit card fraud detection, and click/keyword scoring. This material will be part of our data science apprenticeship, and included in our Wiley book.

Feature selection is a methodology used to detect the best subset of features, out of dozens or hundreds of features (also called variables or rules). By “best”, we mean with highest predictive power, a concept defined in the following subsection. In short, we want to remove duplicate features, simplify a bit the correlation structure (among features) and remove features that bring no value, such as a features taking on random values, thus lacking predictive power, or features (rules) that are almost never triggered (except if they are perfect fraud indicators when triggered).

The problem is combinatorial in nature. You want a manageable, small set of features (say 20 features) selected from (say) a set of 500 features, to run our hidden decision trees (or some other classification / scoring technique) in a way that is statistically robust. But there are 2.7 * 1035 combinations of 20 features out of 500, and you need to compute all of them to find the one with maximum predictive power. This problem is computationally intractable, and you need to find an alternate solution. The good thing is that you don’t need to find the absolute maximum; you just need to find a subset of 20 features that is good enough.

One way to proceed is to compute the predictive power of each feature. Then, add one feature at a time to the subset (starting with 0 feature) until you reach either

20 features (your limit)
Adding a new feature does not significantly improve the overall predictive power of the subset (in short, convergence has been attained)

At each iteration, choose the feature to be added among the two remaining features with the highest predictive power: you will choose (among these two features) the one that increases the overall predictive power (of the subset under construction) most. Now you have reduced your computations from 2.7 * 1035 to 40 = 2 * 20.

Technical note: Additional step to boost predictive power. Remove one feature at a time from the subset, and replace it with a feature randomly selected from the remaining features (from outside the subset). If this new feature boosts overall predictive power of subset, keep it, and otherwise switch back to old subset. Repeat this step 10,000 times or until no more gain is achieved (whichever comes first).

Finally, you can add two or three features at a time, rather than one. Sometimes, combined features have far better predictive power than isolated features. For instance if feature A = country, with values in {USA, UK} and feature B = hour of the day, with values in {“day - Pacific Time”, “night - Pacific Time”}, both features separately have little if any predictive power. But when you combine both of them, you have a much more powerful feature: UK/night is good, USA/night is bad, UK/day is bad, and USA/day is good. Using this blended feature also reduces the risk of false positives / false negatives.

Also, in order to avoid highly granular features, use lists. So instead of having feature A = country (with 200 potential values), and feature B = IP address (with billions of potential values), use:

Feature A = country group, with 3 list of countries (high risk, low risk, neutral). These groups can change over time.
Feature B = type of IP address (with 6-7 types, one being for instance “IP address is in some whitelist” (see section on IP topology mapping for details).

Predictive Power of a Feature, Cross-Validation

Here we illustrate the concept of predictive power on a subset of 2 features, to be a bit more general. Let’s say wie have two binary features A and B taking two possible values 0 or 1. Also, in the context of fraud detection, we assume that each observation in the training set is either Good (no fraud) or Bad (fraud). The fraud status (G or B) is called the response or dependent variable in statistics. The features A and B are also called rules or independent variables.

Cross validation

First, split your training set (the data where the response - B or G - is known) into two parts: control and test. Make sure that both parts are data-rich: if the test set is big (millions of observations) but contain only one or two clients (out of 200), it is data-poor and your statistical inference will be negatively impacted (low robustness) when dealing with data outside the training set. It is a good idea to use two different time periods for control and test. You are going to compute the predictive power (including rule selection) on the control data. When you have decided on a final, optimum subset of features, you will then compute the predictive power on the test data. If the drop in predictive power is significant in the test data (compared with control), something is wrong with your analysis: detect the problem, fix it, start over again. You can use multiple control and test sets: this will give you an idea of how the predictive power varies from one control to another control set. Too much variance is an issue that should be addressed.

Predictive power

Using our above example with two binary features A, B taking on two values 0, 1, we can break the observations from the control data set into 8 categories

A=0, B=0, response = G
A=0, B=1, response = G
A=1, B=0, response = G
A=1, B=1, response = G
A=0, B=0, response = B
A=0, B=1, response = B
A=1, B=0, response = B
A=1, B=1, response = B

Let denote as n1, n2 … n8 the number of observations in each of these 8 categories, and let introduce the following quantities:

P00 = n5 / (n1 + n5), P01 = n6 / (n2 + n6), P10 = n7 / (n3 + n7), P11 = n8 / (n4 + n8)

p = (n5 + n6 + n7 + n8) / (n1 + n2 + … + n8)

Let’s assume that p, measuring the overall proportion of fraud, is less than 50% (that is, p<0.5, otherwise we can swap between fraud and non-fraud). For any r between 0 and 1, define the W function (shaped like a W), based on a parameter a (0 < a < 1, and my recommend is a = 0.5-p) as follows:

W(r) = 1 - (r/p), if 0 < r < p
W(r) = a * (r-p) / (0.5-p), if p < r < 0.5
W(r) = a * (r-1+p) / (p-0.5), if 0.5 < r < 1-p
W(r) = (r-1+p) / p, if 1-p < r < 1

Typically, r = P00, P01, P10 or P11. The W function has the following properties:

It is minimum and equal to 0 when r = p or r = 1-p, that is, when r does not provide any information about fraud / non fraud,
It is maximum and equal to 1when r=1 or r=0, that is, when we have perfect discrimination between fraud and non-fraud, in a given bin.
It is symmetric: W(r) = W(1-r) for 0 < r < 1. So if you swap Good and Bad (G and B), it still provides the same predictive power.

Now let’s define the predictive power:

H = P00 W(P00) + P01 W(P01) + P10 W(P10) + P11 W(P11)

The function H is the predictive power for the feature subset {A, B} having four bins 00, 01, 10, and 11, corresponding to (A=0, B=0), (A=0, B=1), (A=1, B=0) and (A=1, B=1). Although H appears to be remotely related to entropy, our H was designed to satisfy nice properties, and to be parameter-driven, thanks to a. Unlike entropy, our H is not based on physical concepts or models; it is actually a synthetic (though useful) metric.

Note that the weights P00… P11 in H guarantee that bins with low frequency (that is, low triggering rate) have low impact on H. Indeed, I recommend setting W(r) to 0 for any bin that has less than 20 observations. For instance, the triggering rate for bin 00 is (n1 + n5) / (n1 + … + n8), observations count is n1 + n5, and r = P00 = n5 / (n1 + n5) for this bin. If n1 + n5 = 0, set P00 to 0 and W(P00) to 0. I actually recommend to do this not just if n1 + n5 = 0, but also whenever n1 + n5 < 20, especially if p is low (if p is very low, say p < 0.01, you need to over-sample bad transactions when building your training set, and weight the counts accordingly). Of course, the same rule applies to P01, P10, and P11. Note that you should avoid feature subsets that have a large proportion of observations spread across a large number of almost empty bins, as well as feature subsets that produce a large number of empty bins: observations outside the training set are likely to belong to an empty or almost empty bin, and it leads to high-variance predictions. To avoid this drawback, stick to binary features, select up to 20 features, and use our (hybrid) hidden decision tree methodology for scoring transactions. Finally, Pij is the naive estimator of the probability P(A=i, B=j) for i,j = 0,1.

The predictive power H has interesting properties:

It is always between 0 and 1, equal to 0 if the feature subset has no predictive power, and equal to 1 if the feature subset has maximum predictive power.
A generic version of H (not depending on p) can be created by setting p=0.5. Then the W functions are not shaped like a W anymore, they are shaped like a V.

Data structure, computations

You can pre-compute all the bin counts ni’s for the top 20 features (that is, features with highest predictive power) and store them in a small hash table with at most 2 * 220 entries (approx. 2 million; the factor two is because you need two measurements per bin: number of B’s, and number of G’s). An entry in this hash table would like

$Hash{01101001010110100100_G} = 56,

meaning that Bin # 01101001010110100100 has 56 good (G) observations.

The hash table is produced by parsing your training set one time, sequentially: for each observation, compute the flag vector (which rules are triggered, that is the 01101001010110100100 vector in this example), check if it’s good or bad, and update (increase count by 1) the associated hash table entry accordingly, with the following instruction:

$Hash{01101001010110100100_G}++

Then whenever you need to measure the predictive power of a subset of these 20 features, you don’t need to parse your big data set again (potentially billion of observations), but instead, just access this small hash table: this table contains all you need to build your flag vectors and compute scores, for any combination of features that is a subset of the top 20.

You can even do better than top 20, maybe top 30. While this would create a hash table with 2 billion entries, most of these entries would correspond to empty bins and thus would not be in the hash table. Your hash table might contain only 200,000,000 entries, maybe too big to fit in memory, and requiring a Map Reduce / Hadoop implementation.

Even better: build this hash table for the top 40 features. Then it will fully solve your feature selection problem described earlier. However now, your hash table could have up to 2 trillion entries. But if your dataset only has 100 billion observations, then of course your hash table cannot have more than 100 billion entries. In this case, I suggest that you create a training set with 20 million observations, so that your hash table will have at most 20 million entries (probably less than 10 million with all the empty bins), and thus can fit in memory.

You can compute the predictive power of a large number (say 100) of feature subsets by parsing the big 40-feature input hash table obtained in the previous step, then for each flag vector and G/B entry in the input hash table, loop over the 100 target feature subsets to update counts (the ni’s) for these 100 feature subsets: these counts are stored / updated in an output hash table. The key in the output hash table has two components: feature ID and flag vector. You then loop over the output hash table to compute the predictive power for each feature subset. This step can be further optimized.

Vincent Granville Articles and Books