Vincent Granville Articles and Books: Nice Generalization of the K-NN Clustering Algorithm -- Also Useful for Data Reduction

I describe here an interesting and intuitive clustering algorithm (that can be used for data reduction as well) offering several advantages, over traditional classifiers:

More robust against outliers and erroneous data
Executing much faster
Generalizing well known algorithms

You don't need to know K-NN to understand this article -- but click here if you want to learn more about it. You don't need a background in statistical science either.

Let's describe this new algorithm and its various components, in simple English

General Framework

We are dealing here with a supervised learning problem, and more specifically, clustering (also called supervised classification.). In particular, we want to assign a class label to a new observation that does not belong to the training set. Instead of checking out individual points (the nearest neighbors) and using a majority (voting) rule to assign the new observation to a cluster based on nearest neighbor counts, we are checking out cliques of points, and focus on the nearest cliques rather than on the nearest points.

Cliques computed based on the Smallest-Circle-Problem

Cliques and Clique Density

The cliques considered here are defined by circles (in two dimensions) or spheres (in three dimensions.) In the most basic version, we have one clique for each cluster, and the clique is defined as the smallest circle containing a pre-specified proportion p of the points from the cluster in question. If the clusters are well separated, we can even use p = 1. We define the density of a clique as the number of points per unit area. In general, we want to build cliques with high density.

Ideally, we want each cluster in the training set to be covered by a small number of (possibly slightly overlapping) cliques, each one having a high density. Also, as a general rule, a training set point can only belong to one clique, and (ideally) to only one cluster. But the circles associated with two cliques are allowed to overlap.

Additional sections:

Classification Rule, Computational Complexity, Memory Requirements
Cliques Building and Smallest-Circle-Problem
Gravitational Field Generated by Clusters
Building an Efficient Clique System
Non-Circular Cliques
Source Code
Potential Enhancements, Data Reduction, and Conclusions

To read the full article, click here. To access data science articles from the same author, click here.

DSC Resources

Services: Hire a Data Scientist | Search DSC | Classifieds | Find a Job
Contributors: Post a Blog | Ask a Question
Follow us: @DataScienceCtrl | @AnalyticBridge

Vincent Granville Articles and Books

Tuesday, August 15, 2017

Nice Generalization of the K-NN Clustering Algorithm -- Also Useful for Data Reduction

No comments:

Post a Comment

Fuzzy Regression: A Generic, Model-free, Math-free Machine Learning Technique

Blog Archive