Under-Sampling Methods for Imbalanced Data (ClusterCentroids, RandomUnderSampler, NearMiss)
The imbalance of data is a big problem for classification tasks. In python, there is a library to allow to use of many algorithms to handle this imbalanced state of the data and its harms.
imbalanced-learn is a python package offering a several re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects. There is an introduction, you can read the imbalanced-learn library and get some idea about its outline.
In this series, each technique will be handled separately. Re-sampling techniques are divided into two categories mainly;
● Under-sampling the majority class(es).
● Over-sampling the minority class.
There are others; ‘combining over and under sampling’ and ‘create ensemble balanced sets’.
This is about under-sampling;
Under-sampling techniques are two types, prototype generation, and prototype selection.
A. Prototype Generation
Prototype generation algorithms will reduce the number of samples while generating a new set according to the given original data set. It means that it will remove the majority class and generating a new set. There is one algorithm in the imbalanced-learn library, which is ClusterCentroids.
ClusterCentroids
This technique makes undersampling by generating a new set based on centroids by clustering methods. The algorithm is generating a new set according to the cluster centroid of a KMeans algorithm.
A method that under samples the majority class by replacing a cluster of majority samples with the cluster centroid of a KMeans algorithm.
The newly generated set is synthesized with the centroids of the K-means method instead of the original samples. It transforms the majority class(es), the minority class preserves itself.
We will use the travel_insurance dataset for the examination.
There is an imbalance between 0 and 1 at the target feature.
After the undersampling technique (ClusterCentroids), the number of majority classes (it is 0) decreased to the number of minority classes (it is 1). The data are balanced now.
We can check the original data and undersampling data;
At the original data set, there are some instances which are Claim is 0, Duration is 46 and Age is 33. After ClusterCentroids, all of them are removed and generated a new one. Net Sales and Commission features which are the continuous variables are synthesized based on centroids.
For the instance which is Claim is 1, Duration is 76 and Age is 36, there is nothing to change. The minority class preserves its original form.
from imblearn.under_sampling import ClusterCentroids
undersampler = ClusterCentroids()X_smote, y_smote = undersampler.fit_resample(X_train, y_train)
There are some parameters at ClusterCentroids, with sampling_strategy we can adjust the ratio between minority and majority classes. We can change the algorithm of the estimator by the estimator parameter. Its default value is KMeans. Check this link for the parameters.
B. Prototype Selection
Unlike to prototype generation algorithm, prototype selection algorithms will select samples from the original set.
Prototype selection algorithms will reduce the number of majority class(es), but they won’t generate new instances.
Prototype selection algorithms are divided into two groups:
● Controlled under-sampling techniques.
● Cleaning under-sampling techniques.
At the controlled under-sampling techniques, the user can define the number of samples. The algorithms are worked under control.
At the cleaning under-sampling techniques, the user can’t define the number of samples. The algorithms work themselves, clean the feature space and try to reduce noise.
B.1. Controlled under-sampling techniques
There are two algorithms at the imbalanced-learn library as controlled under-sampling techniques
RandomUnderSampler
RandomUnderSampler randomly deletes the rows of the majority class(es) according to our sampling strategy.
RandomUnderSampler is a fast and easy way to balance the data by randomly selecting a subset of data for the targeted classes.
from imblearn.under_sampling import RandomUnderSamplerunder = RandomUnderSampler(sampling_strategy=1)
X_smote, y_smote = under.fit_resample(X_train, y_train)
This topic has been covered before at here. For more detail, you can check this link.
NearMiss
NearMiss is a kind of controlled under-sampling technique. It has some rules additional to RandomUnderSampler. NearMiss has two important parameters; these are version and n_neighbors.
NearMiss has 3 versions. We can define the version with the version parameter.
n_neighbors refer to the size of the neighborhood to consider to compute the average distance to the minority point samples.
NearMiss-1 selects samples from the majority class for which the average distance of the k nearest samples of the minority class is the smallest. NearMiss-2 selects the samples from the majority class for which the average distance to the farthest samples of the negative class is the smallest. NearMiss-3 is a 2-step algorithm: first, for each minority sample, their m nearest-neighbors will be kept; then, the majority samples selected are the on for which the average distance to the k nearest neighbors is the largest.
We can look at them on plots.
NearMiss-1 selects the positive samples for which the average distance to the k closest samples of the negative class is the smallest.
NearMiss-2 selects the positive samples for which the average distance to the k farthest samples of the negative class is the smallest.
NearMiss-3 is a 2-steps algorithm. First, for each negative sample, their m nearest-neighbors will be kept. Then, the positive samples selected are the ones for which the average distance to the k nearest-neighbors is the largest.
from imblearn.under_sampling import NearMissundersampler = NearMiss(version = 1, n_neighbors = 3)
X_smote, y_smote = undersampler.fit_resample(X_train, y_train)
For more detail, you can check this link.