Feature Selection with BorutaPy, RFE and Univariate Feature Selection
Feature selection is one of the most critical stages of a machine learning pipeline. We may have to struggle with a lot of features or useless features, so we have to do some elimination of features. So why is feature selection important? Because we can increase the performance and quality of our model, we can shorten the training time. We do not want to train our model on unnecessary features, we eliminate noisy (non-informative) features that do not contribute to our model or cause overfitting. The feature selection can be the key to a successful machine learning process.
Feature selection is a technique where we choose features from our data that contribute most to the target variable. In other words, we choose the best predictors for the target variable so we can predict the target with higher accuracy and with lower variance, preferably.
Although the terms feature selection and feature extraction are sometimes used interchangeably, they are indeed two separate techniques. Feature selection gives us exactly the features themselves. It does not perform any conversion or transformation; it only removes unnecessary ones according to given constraints. On the other hand, with feature extraction, we do not know which features are more important than others. Feature extraction transforms features, creating a combination of features that represent all features. Feature importance is another term, often appears as a sub-stage within feature selection methods where features are sorted according to their importance level, i.e., their contribution to the model (output).
In this article we will focus on feature selection, feature extraction and feature importance will be the topic of another article.
Sci-kit learn library, the most popular library for data science in Python offers some useful methods for feature selection. Boruta library also provides a handy, scikit-learn compatible api for Boruta feature selection algorithm. We will be mainly focusing on techniques mentioned above.
Feature selection techniques will be applied to the diamond dataset from Seaborn. For the sake of simplicity, we have removed the categorical features and split our data. Price is the target variable.
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_splitdf = sns.load_dataset('diamonds')
df = df.drop(['cut', 'color', 'clarity'], axis = 1)X = df.drop('price', axis = 1)
y = df['price']X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
All the methods below have a statistical and mathematical background that could be explored in-depth, yet we will just give a simple introduction.
1 — BorutaPy
Boruta is an improved Python implementation of the Boruta R package. We will use BorutaPy from the Boruta library. BorutaPy is a feature selection algorithm based on NumPy, SciPy, and Sklearn.
We can use BorutaPy just like any other scikit learner: fit, fit_transform and transform are all implemented similarly. BorutaPy needs an estimator and an ensemble method could be used: random forest, extra trees classifier, even gradient boosted trees.
BorutaPy generates shadow features (shuffled copies of all features) of the original data and calculates the threshold based on the estimator’s feature importance of shadow features.
from sklearn.ensemble import RandomForestRegressor
from boruta import BorutaPyforest = RandomForestRegressor(max_depth = 5)feat_selector = BorutaPy(forest, n_estimators='auto', verbose=2, random_state=1)feat_selector.fit(np.array(X_train), np.array(y_train))
After fitting BorutaPy it provides the user with ranking of features. Confirmed ones are 1, tentatives are 2, and the rejected ones are 3, based on their feature importance history through the iterations.
The important point is for BorutaPy, multicollinearity should be removed before running it.
2— Recursive Feature Elimination (RFE)
RFE makes an elimination with the estimator which provides feature importances to RFE. For example, if the estimator is Linear Regression, RFE uses coefficients of the linear model; if the estimator is Random Forest, then RFE uses feature importance method of Random Forest, etc.
RFE filters the features according to a number that the user wants to select, by the weights which are assigned by the external estimator (supervised learning algorithm).
There are other versions of recursive feature elimination in the sklearn.feature_importance;
RFECVRecursive feature elimination with a built-in cross-validated selection of the best number of features.SelectFromModelFeature selection based on thresholds of importance weights.SequentialFeatureSelectorSequential cross-validation based feature selection. Does not rely on importance weights.
All these methods mentioned above use the same source: estimator assigned weights on features (feature importance). They just need extra parameters. We will apply RFE to the diamonds dataset.
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegressionmodel = LinearRegression()
rfe = RFE(estimator = model, n_features_to_select = 5)
fit = rfe.fit(X_train, y_train)
RFE recursively refit the model and eliminated the worst feature at each step and added back eliminated features one feature (if not specified otherwise by user) at a time to search for the best combination of features until the number of features fell down to a preset limit. We defined the n_features_to_select as 5, so it gave us 5 features out of 6 initial features.
We can control the RFE according to the coefficients of the linear regression model (we can say feature importance). The ‘z’ feature is the last. RFE has removed the ‘z’ feature.
3 — Univariate Feature Selection
Univariate feature selection works by selecting the best features based on bivariate statistical tests or univariate statistics.
There are different univariate feature selection methods in sklearn, we will focus on the most commonly used SelectKBest, there are other methods in sklearn.feature_selection;
SelectPercentileSelect features based on the percentile of the highest scores.SelectFprSelect features based on a false positive rate test.SelectFdrSelect features based on an estimated false discovery rate.SelectFweSelect features based on family-wise error rate.GenericUnivariateSelectUnivariate feature selector with configurable mode.
There is also another method, removing features with low variance. It is not widely used, however might come in handy in when looking for a method in our pipeline to quickly remove constants.
We use the SelectKBest method, but we also need the score function, each method above needs the score function.
- For regression: f_regression, mutual_info_regression
- For classification: chi2, f_classif, mutual_info_classif
These functions use some tests, f_regression uses univariate linear regression tests, f_classif uses ANOVA F-value method, chi2 k-uses chi-square statistics. mutual_info_regression and mutual_info_classif functions based on entropy estimation from k-nearest neighbors distances.
SelectKBest takes another parameter, k, besides the score function. SelectKBest gives scores based on the score function and selects k number of features in turn.
from sklearn.feature_selection import SelectKBest, f_regressionselector = SelectKBest(score_func = f_classif, k = 5)X_new = selector.fit_transform(X_train, y_train)
names = X_train.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])ns_df.sort_values('F_Scores', ascending=False)
SelectKBest labels the features as True and False, in the example above we set k as 5. With the True tag, we will get the best 5 attributes (strongest relationship to output). The ‘depth’ feature is labeled False because we have 6 features and the ‘depth’ is the last one in terms of association determined by chosen test.
Feature selection methods select the features according to our decisions that include parameters like the number of the features or a threshold. It is the main difference between feature selection and feature importance.
BorutaPy is a more robust solution for feature selection because it doesn’t need any parameters or threshold. But for other methods, we have to give a number or threshold.
All methods have different properties, strengths, and weaknesses for different types of datasets. There is no one-works-for all feature selection method, so we must choose functional and true one for our data.