Vincent Granville Articles and Books: Simple Solution to Feature Selection Problems

Wednesday, June 20, 2018

Simple Solution to Feature Selection Problems

We discuss a new approach for selecting features from a large set of features, in an unsupervised machine learning framework. In supervised learning such as linear regression or supervised clustering, it is possible to test the predicting power of a set of features (also called independent variables by statisticians, or predictors) using metrics such as goodness of fit with the response (the dependent variable), for instance using the R-squared coefficient. This makes the process of feature selection rather easy.

Here this is not feasible. The context could be pure clustering, with no training sets available, for instance in a fraud detection problem. We are also dealing with discrete and continuous variables, possibly including dummy variables that represent categories, such as gender. We assume that no simple statistical model explains the data, so the framework here is model-free, data-driven. In this context, traditional methods are based on information theory metrics to determine which subset of features brings the largest amount of information.

A classic approach consists of identifying the most information-rich feature, and then grow the set of selected features by adding new ones that maximize some criterion. There are many variants to this approach, for instance adding more than one feature at a time, or removing some features during the iterative feature selection algorithm. The search for an optimal solution to this combinatorial problem is not computationally feasible if the number of features is large, so an approximate solution (local optimum) is usually acceptable, and accurate enough for business purposes.

Content of this article:

Review of popular methods
New, simple idea for feature selection
Testing on a dataset with known theoretical entropy (and conclusions)

Read the full article, here.

Vincent Granville Articles and Books

Wednesday, June 20, 2018

Simple Solution to Feature Selection Problems

No comments:

Post a Comment

Fuzzy Regression: A Generic, Model-free, Math-free Machine Learning Technique

Blog Archive