We want the data we prepared or analyzed for the model to be perfect. However, data may have missing values, outliers, complex data types. We always do some preprocessing. This process is normal. However, sometimes we encounter imbalanced data in classification tasks. This is unusual for algorithms and this imbalanced situation needs to be fixed.
For example, you have a binary classification task and you have 100 records (rows) in your data, of which 90 rows are labeled as 1 and the remaining 10 rows are labeled as 0.
With this data, our model would be biased. The predictions will be dominated by the majority class.
To prevent this, we can refer to the Imbalanced-learn Library.
Imbalanced-learn (imported as
imblearn) is an open source, MIT-licensed library relying on scikit-learn (imported as
sklearn) and provides tools when dealing with classification with imbalanced classes.
The Imbalanced-learn library includes some methods for handling imbalanced data. These are mainly; under-sampling, over-sampling, a combination of over-and under-sampling, and an ensemble of samplers. You can take the relevant method or algorithm and apply it to your data.
Each method mentioned above contains many techniques and algorithms. You can use boosting or bagging algorithms in ensemble methods. You can create a pipeline with SMOTE and under-sampling methods. It depends on your data, analysis, and approach. See the documentation for details.
To give you an idea, we will apply random resampling techniques, naive over_sampling and under_sampling methods, which are the most common imblearn library implementations. These are RandomOverSampler and RandomUnderSampler.
What do these methods do?
We can define;
- RandomOverSampler duplicates rows of the minority class.
- RandomUnderSampler deletes rows of the majority class.
These two methods make the duplicating and deleting randomly. Naive resampling methods are best if we want to get balanced data quickly and easily.
We should note that we only apply it to training data. We just balance our training data, our test data remain as is (original distribution). It means that we apply resampling methods after dividing our data into training and testing.
We will analyze the travel insurance data to apply our resampling methods, the data is as follows.
We have a binary classification problem. Our target feature is “Claim”. 0 is the majority, 1 is the minority class. The target distribution is like that;
We will apply Logistic Regression to compare the results between imbalanced data and resampling data. This dataset is from the kaggle, its fame comes from being a formidable unbalanced dataset, and the low scores. We don’t have an exploratory data analysis process to better see the comparison, so scores may below.
We applied Logistic Regression to the data before resampling methods. According to the classification report; precision, recall, and f1-score of 1 are 0 because the model couldn’t learn. The model predicted all records to be 0 in favor of the majority class. It gives us a prediction model that always returns the majority class. It ignores the minority class.
The f1 score is the most suitable measure for unbalanced dataset models. So we use the f1 score to compare.
Now we will apply in order, RandomOverSampler, RandomUnderSampler, and the combination of over-under sampler.
We add synthetic rows to the data with the RandomOverSampler. We make the number of target values equal by increasing the minority class. This can be beneficial or harmful, depends on the quantities. However, we do not want to make our data synthetic. If there is a significant difference between the majority and minority class, apply this method carefully, maybe we can adjust the sampling strategy parameter.
We set the sampling strategy to 1. It means that the minority class will be the same amount (1 to 1) as the majority class, the minority class will copy their rows. Check y_smote’s value_counts (y_train converted to y_smote with resampling method)
We split our data into training and testing and apply RandomOverSampler only to training data (X_train and y_train). If we resampling the test data or all data, it may cause data leakage.
from imblearn.over_sampling import RandomOverSamplerover = RandomOverSampler(sampling_strategy=1)X_smote, y_smote = over.fit_resample(X_train, y_train)
After Logistic Regression, the metrics of the data are above. The score increases by 9.52% after RandomOverSampler.
RandomUnderSampler randomly deletes the rows of the majority class according to our sampling strategy. This resampling method deletes the actual data consider this situation. We don’t want to lose or compress our data.
We adjust the sampling strategy as 1. It means that the majority class will be the same amount as the minority class (1 to 1), the majority class will lose rows. Check y_smote’s value_counts (y_train converted to y_smote by resampling method).
We split our data into training and testing and apply RandomUnderSampler only to training data (X_train and y_train).
from imblearn.under_sampling import RandomUnderSamplerunder = RandomUnderSampler(sampling_strategy=1)
X_smote, y_smote = under.fit_resample(X_train, y_train)
After Logistic Regression, the metrics of the data are above. The score increases by 9.37% after RandomUnderSampler.
The common use of these resampling methods is to combine them in a pipeline. It is not recommended to use just one of these in big datasets, which is an important difference between the majority and minority class.
As mentioned above, it is not recommended to apply only oversampling or undersampling methods to big data which has a significant difference between classes. We have an extra option, we can apply oversampling and undersampling methods simultaneously with the pipeline. We will combine these two methods with adjusting the sampling strategy.
We create a pipeline with imblearn.pipeline, give the steps (over and under) sequentially. RandomOverSampler with a 0.1 sampling strategy raises the minority class to “0.1 * majority class”. Next, RandomUnderSampler with a 0.5 sampling strategy reduces the majority class quantity to “2 * minority class”. At the end of the pipeline, the ratio between minority and majority classes will be 0.5.
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSamplerover = RandomOverSampler(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)from imblearn.pipeline import Pipelinesteps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)X_smote, y_smote = pipeline.fit_resample(X_train, y_train)
After Logistic Regression, the metrics of the data are above. After the pipeline, the score increased by 11.83%. We get the best score with Pipeline.
We should note that we only apply these methods to training data. We just balance our training data, our test data remain as is (original distribution).
There are other techniques and algorithms in the imblearn library, check this library to solve imbalanced dataset problems.
We should use these techniques carefully because they manipulate data.
imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing…