Tuesday, March 15, 2022

Fast Combinatorial Feature Selection with New Definition of Predictive Power (2013)

 In this article, I proposes a simple metric to measure predictive power. It is used for combinatorial feature selection, where a large number of feature combinations need to be ranked automatically and very fast, for instance in the context of transaction scoring, in order to optimize predictive models. This is about rather big data, and we would like to see an Hadoop methodology for the technology proposed here. It can easily be implemented in a Map Reduce framework. It  was developed by the author in the context of credit card fraud detection, and click/keyword scoring. This material will be part of our data science apprenticeship, and included in our Wiley book.

Feature selection is a methodology used to detect the best subset of features, out of dozens or hundreds of features (also called variables or rules). By “best”, we mean with highest predictive power, a concept defined in the following subsection. In short, we want to remove duplicate features, simplify a bit the correlation structure (among features) and remove features that bring no value, such as a features taking on random values, thus lacking predictive power, or features (rules) that are almost never triggered (except if they are perfect fraud indicators when triggered).

The problem is combinatorial in nature. You want a manageable, small set of features (say 20 features) selected from (say) a set of 500 features, to run our hidden decision trees (or some other classification / scoring technique) in a way that is statistically robust.  But there are 2.7 * 1035 combinations of 20 features out of 500, and you need to compute all of them to find the one with maximum predictive power. This problem is computationally intractable, and you need to find an alternate solution. The good thing is that you don’t need to find the absolute maximum; you just need to find a subset of 20 features that is good enough.

One way to proceed is to compute the predictive power of each feature. Then, add one feature at a time to the subset (starting with 0 feature) until you reach either

  • 20 features (your limit)
  • Adding a new feature does not significantly improve the overall predictive power of the subset (in short, convergence has been attained)

At each iteration, choose the feature to be added among the two remaining features with the highest predictive power: you will choose (among these two features) the one that increases the overall predictive power (of the subset under construction) most. Now you have reduced your computations from 2.7 * 1035 to 40 = 2 * 20.

Technical note: Additional step to boost predictive power. Remove one feature at a time from the subset, and replace it with a feature randomly selected from the remaining features (from outside the subset). If this new feature boosts overall predictive power of subset, keep it, and otherwise switch back to old subset. Repeat this step 10,000 times or until no more gain is achieved (whichever comes first).

Finally, you can add two or three features at a time, rather than one. Sometimes, combined features have far better predictive power than isolated features. For instance if feature A = country, with values in {USA, UK} and feature B = hour of the day, with values in {“day - Pacific Time”, “night - Pacific Time”}, both features separately have little if any predictive power. But when you combine both of them, you have a much more powerful feature: UK/night is good, USA/night is bad, UK/day is bad, and USA/day is good. Using this blended feature also reduces the risk of false positives / false negatives.

Also, in order to avoid highly granular features, use lists. So instead of having feature A = country (with 200 potential values), and feature B = IP address (with billions of potential values), use:

  • Feature A = country group, with 3 list of countries (high risk, low risk, neutral). These groups can change over time.
  • Feature B = type of IP address (with 6-7 types, one being for instance “IP address is in some whitelist” (see section on IP topology mapping for details).

Predictive Power of a Feature, Cross-Validation

Here we illustrate the concept of predictive power on a subset of 2 features, to be a bit more general. Let’s say wie have two binary features A and B taking two possible values 0 or 1. Also, in the context of fraud detection, we assume that each observation in the training set is either Good (no fraud) or Bad (fraud). The fraud status (G or B) is called the response or dependent variable in statistics. The features A and B are also called rules or independent variables.

Cross validation

First, split your training set (the data where the response - B or G - is known) into two parts: control and test. Make sure that both parts are data-rich: if the test set is big (millions of observations) but contain only one or two clients (out of 200), it is data-poor and your statistical inference will be negatively impacted (low robustness) when dealing with data outside the training set.  It is a good idea to use two different time periods for control and test. You are going to compute the predictive power (including rule selection) on the control data. When you have decided on a final, optimum subset of features, you will then compute the predictive power on the test data. If the drop in predictive power is significant in the test data (compared with control), something is wrong with your analysis: detect the problem, fix it, start over again. You can use multiple control and test sets: this will give you an idea of how the predictive power varies from one control to another control set. Too much variance is an issue that should be addressed.

Predictive power

Using our above example with two binary features A, B taking on two values 0, 1, we can break the observations from the control data set into 8 categories

  1. A=0, B=0, response = G
  2. A=0, B=1, response = G
  3. A=1, B=0, response = G
  4. A=1, B=1, response = G
  5. A=0, B=0, response = B
  6. A=0, B=1, response = B
  7. A=1, B=0, response = B
  8. A=1, B=1, response = B

 

Let denote as n1, n2 … n8 the number of observations in each of these 8 categories, and let introduce the following quantities:

P00 = n5 / (n1 + n5), P01 = n6 / (n2 + n6), P10 = n7 / (n3 + n7), P11 = n8 / (n4 + n8)

p = (n5 + n6 + n7 + n8) / (n1 + n2 + … + n8)

Let’s assume that p, measuring the overall proportion of fraud, is less than 50% (that is, p<0.5, otherwise we can swap between fraud and non-fraud). For any r between 0 and 1, define the W function (shaped like a W), based on a parameter a (0 < a < 1, and my recommend is a = 0.5-p) as follows:

  • W(r) = 1 - (r/p), if 0 < r < p
  • W(r) = a * (r-p) / (0.5-p), if p < r < 0.5
  • W(r) = a * (r-1+p) / (p-0.5), if 0.5 < r < 1-p
  • W(r) = (r-1+p) / p, if 1-p < r < 1

 

Typically, r = P00, P01, P10 or P11. The W function has the following properties:

  • It is minimum and equal to 0 when r = p or r = 1-p, that is, when r does not provide any information about fraud / non fraud,
  • It is maximum and equal to 1when r=1 or r=0, that is, when we have perfect discrimination between fraud and non-fraud, in a given bin.
  • It is symmetric: W(r) = W(1-r) for 0 < r < 1. So if you swap Good and Bad (G and B), it still provides the same predictive power.

Now let’s define the predictive power:

H = P00 W(P00) + P01 W(P01) + P10 W(P10) + P11 W(P11)

The function H is the predictive power for the feature subset {A, B} having four bins 00, 01, 10, and 11, corresponding to (A=0, B=0), (A=0, B=1), (A=1, B=0) and (A=1, B=1). Although H appears to be remotely related to entropy, our H was designed to satisfy nice properties, and to be parameter-driven, thanks to a. Unlike entropy, our H is not based on physical concepts or models; it is actually a synthetic (though useful) metric.

Note that the weights P00… P11 in H guarantee that bins with low frequency (that is, low triggering rate) have low impact on H. Indeed, I recommend setting W(r) to 0 for any bin that has less than 20 observations. For instance, the triggering rate for bin 00 is (n1 + n5) / (n1 + … + n8), observations count is n1 + n5, and r = P00 = n5 / (n1 + n5) for this bin. If n1 + n5 = 0, set P00 to 0 and W(P00) to 0. I actually recommend to do this not just if n1 + n5 = 0, but also whenever n1 + n5 < 20, especially if p is low (if p is very low, say p < 0.01, you need to over-sample bad transactions when building your training set, and weight the counts accordingly). Of course, the same rule applies to P01, P10, and P11. Note that you should avoid feature subsets that have a large proportion of observations spread across a large number of almost empty bins, as well as feature subsets that produce a large number of empty bins: observations outside the training set are likely to belong to an empty or almost empty bin, and it leads to high-variance predictions. To avoid this drawback, stick to binary features, select up to 20 features, and use our (hybrid) hidden decision tree methodology for scoring transactions. Finally, Pij is the naive estimator of the probability P(A=i, B=j) for i,j = 0,1.

The predictive power H has interesting properties:

  • It is always between 0 and 1, equal to 0 if the feature subset has no predictive power, and equal to 1 if the feature subset has maximum predictive power.
  • A generic version of H (not depending on p) can be created by setting p=0.5. Then the W functions are not shaped like a W anymore, they are shaped like a V.

Data structure, computations

You can pre-compute all the bin counts ni’s for the top 20 features (that is, features with highest predictive power) and store them in a small hash table with at most 2 * 220 entries (approx. 2 million; the factor two is because you need two measurements per bin: number of B’s, and number of G’s). An entry in this hash table would like

$Hash{01101001010110100100_G} = 56,

meaning that Bin # 01101001010110100100 has 56 good (G) observations.

The hash table is produced by parsing your training set one time, sequentially: for each observation, compute the flag vector (which rules are triggered, that is the 01101001010110100100 vector in this example), check if it’s good or bad, and update (increase count by 1) the associated hash table entry accordingly, with the following instruction:

$Hash{01101001010110100100_G}++

Then whenever you need to measure the predictive power of a subset of these 20 features, you don’t need to parse your big data set again (potentially billion of observations), but instead, just access this small hash table: this table contains all you need to build your flag vectors and compute scores, for any combination of features that is a subset of the top 20.

You can even do better than top 20, maybe top 30. While this would create a hash table with 2 billion entries, most of these entries would correspond to empty bins and thus would not be in the hash table. Your hash table might contain only 200,000,000 entries, maybe too big to fit in memory, and requiring a Map Reduce / Hadoop implementation.

Even better: build this hash table for the top 40 features. Then it will fully solve your feature selection problem described earlier. However now, your hash table could have up to 2 trillion entries. But if your dataset only has 100 billion observations, then of course your hash table cannot have more than 100 billion entries. In this case, I suggest that you create a training set with 20 million observations, so that your hash table will have at most 20 million entries (probably less than 10 million with all the empty bins),  and thus can fit in memory.

You can compute the predictive power of a large number (say 100) of feature subsets by parsing the big 40-feature input hash table obtained in the previous step, then for each flag vector and G/B entry in the input hash table, loop over the 100 target feature subsets to update counts (the ni’s) for these 100 feature subsets: these counts are stored / updated in an output hash table. The key in the output hash table has two components: feature ID and flag vector. You then loop over the output hash table to compute the predictive power for each feature subset. This step can be further optimized.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Fuzzy Regression: A Generic, Model-free, Math-free Machine Learning Technique

  A different way to do regression with prediction intervals. In Python and without math. No calculus, no matrix algebra, no statistical eng...