10 Python One-Liners for Feature Selection Like a Pro
Image by Editor | Midjourney
In many data analysis processes, including machine learning, data preprocessing is an important stage before further analysis or model training and evaluation. As part of data preprocessing, feature selection plays a critical role in improving the quality of the analysis or model performance, focusing on the data that is truly relevant and useful for tackling the problem at hand. Feature selection procedures can vary depending on the nature and complexity of the dataset being manipulated.
This article takes a practical tour through 10 Python one-liners — single lines of code that accomplish meaningful tasks efficiently and concisely — specifically introducing 10 usual and handy one-liners to keep in your notebook to perform feature selection in a variety of datasets.
Before starting, it is necessary to import some essential Python libraries and modules we will use. In particular, to illustrate the below examples, we will import two datasets freely available in scikit-learn’s datasets
module: the wine dataset and the breast cancer dataset.
import pandas as pd import numpy as np from sklearn.datasets import load_wine, load_breast_cancer from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, RFE from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA import seaborn as sns |
To exemplify the 10 one-liners below, we will assume the datasets have loaded into Pandas dataframes, as follows:
wine_data = pd.DataFrame(load_wine().data, columns=load_wine().feature_names) bc_data = pd.DataFrame(load_breast_cancer().data, columns=load_breast_cancer().feature_names) |
1. Selection Based on Variance Threshold
Feature selection based on variance is a very frequent feature selection approach aimed at removing features that have low variability, and hence they might be uninformative for machine learning models.
This one-liner applies a column-based filtering on a Pandas dataframe wine_data
containing the wine dataset, so that it only keeps features (columns) with variance greater than 0.8.
X_selected = wine_data.loc[:, wine_data.var() > 0.8] |
As a result, only 6 out of the original 13 features in this dataset are kept, being the other ones deemed as less informative and near-constant. The filtered data itself and the number of features can both be checked by executing X_selected.head()
and X_selected.shape
, respectively.
2. Correlation-Based Feature Selection
This one-liner is illustrated through the breast cancer dataset: a labeled dataset that contains 30 predominantly numerical features plus one binary target variable — the breast cancer diagnosis, which can be positive or negative — accessible by using load_breast_cancer().target
. Correlation-based feature selection can be used thanks to the corrwith()
method, and it consists in selecting the features that have a certain degree of correlation with the target variable, for instance, keeping those features with an absolute correlation greater than 0.5 with the target labels.
X_selected = bc_data.loc[:, bc_data.columns[abs(bc_data.corrwith(pd.Series(load_breast_cancer().target))) > 0.5].tolist()] |
Note that the result of this one-liner is a Python list containing the selected feature names. You may want to encapsulate it inside
3. Select K Best Features with F-Test
The F-Test is based on ANOVA, a statistical test used to check for statistically significant differences among three or more groups assumed to be independent. ANOVA’s F-value can be used as follows to determine the relationship between each feature and the target variable (which in the wine dataset is also accessible separately using .target()
).
X_best = SelectKBest(f_classif, k=5).fit_transform(wine_data, load_wine().target) |
This example one-liner identified the five features that have the strongest relationship with the labels. Both this and the previous feature selection approach based on correlation are helpful to train classification models that are more likely to better capture the relationships between predictor features and the target variable to be predicted on future unseen data.
4. Select K Best Features with Mutual Information
The method to select features based on mutual information measures the statistical dependence between two variables, in our example, between each variable and the target. Those selected features will be, as a result, highly informative of the target value in a given instance.
The procedure to apply this method is quite similar to the previous F-test method, both requiring the specification of the the number K of “top” features to retain.
X_best = SelectKBest(mutual_info_classif, k=6).fit_transform(bc_data, load_breast_cancer().target) |
5. Feature Importance By Leveraging Random Forest
Did you know we can resort to a random forest ensemble, concretely its calculated feature importance, and leverage this information to perform feature selection? This example illustrates how:
selected_features = wine_data.columns[np.argsort(RandomForestClassifier(random_state=42).fit(wine_data, load_wine().target).feature_importances_)[–7:]].tolist() |
A lot seems to go on in this single line of code! Let’s break it down:
- The
RandomForestClassifier(random_state=42).fit(wine_data, load_wine().target).feature_importances_
part initializes and trains a simple random forest classifier with the wine dataset’s features and target, and returns a numpy array of floating-point values that are the calculated importance weights of features in the dataset. - This array of weights is passed as an argument to the
np.argsort()
that applies a sorting of the array indices so that their associated weights would be arranged in ascending order. By adding the[-7:] slicing operator at the end, we select the indices associated with the 7 highest weights, which will help us determine the 7 features with the highest importance and get them as a list of feature names.
6. Select Top Features via Recursive Feature Elimination and Logistic Regression
Recursive Feature Elimination (RFE) is a technique that recursively eliminates features guided by a trained logistic regression model. It removes the least important ones until a specified number of them (for example, eight) eventually remain: all in a single line as we show below.
X_rfe = RFE(LogisticRegression(max_iter=1000), n_features_to_select=8).fit_transform(StandardScaler().fit_transform(bc_data), load_breast_cancer().target) |
The processes entailed feature scaling for better results, and training an RFE object that takes two arguments: the logistic regression model and the desired number of features to keep.
7. Principal Component Analysis for Feature Selection
Principal Component Analysis (PCA) is a well-known dimensionality reduction technique that selects enough principal components to explain most (typically around 90%, as shown in the example one-liner below) of the variance in the original data. To do this, it applies eigen-decomposition of the covariance matrix associated with the dataset features, and projects the original data onto the principal components. Using PCA as part of feature selection also requires feature scaling for best results.
X_pca = PCA(n_components=0.9, random_state=42).fit_transform(StandardScaler().fit_transform(wine_data)) |
8. Feature Selection Based on Missing Values
A very common-sense criterion to perform feature selection when we have data pervaded by missing values, is to keep only those features that have mostly non-missing values, say at least 90% of values being present.
This one-liner makes use of Panda’s dropna()
function at the column level to do this. Here’s how:
selected_cols = wine_data.dropna(thresh=len(wine_data)*0.9, axis=1).columns.tolist() |
For this example to work perfectly, we recommend you load another version of the wines dataset from here.
9. L1-Based Feature Selection
L1 regularization can be combined with logistic regression models to automatically perform feature selection. The below one-liner applies the following steps:
- Standardize the data before regularization and train a logistic regression model on the wine dataset, with L1 (Lasso) weight penalty and a regularization strength control level of
C=0.5
. - The coefficients from the first of the two classes (index 0)are extracted to create a list of indices with non-zero coefficients.
- Based on these indices, only features with non-zero coefficients are selected and returned as a list of column names.
selected_features = wine_data.columns[np.any(LogisticRegression(penalty=‘l1’, solver=‘liblinear’, C=0.5).fit(StandardScaler().fit_transform(wine_data), load_wine().target).coef_ != 0, axis=0)].tolist() |
10. Removing Multicollinear Features
Multicollinearity refers to strong correlation or linear dependency among predictor variables in a dataset’s features. This final one-liner efficiently removes multicollinearity in the breast cancer dataset by eliminating redundant features. It systematically analyzes each feature and compares it to the rest, maintaining only those that aren’t strongly correlated (>0.85) with any other variable. The approach is applied iteratively through all columns.
keep_cols = [col for col in bc_data.columns if not any(bc_data.drop(columns=col).corrwith(bc_data[col]).abs() > 0.85)] |
Conclusion
This article took a glimpse at ten effective Python one-liners that, once familiar with, will turbocharge your process of selecting relevant features from your data for further analysis or machine learning modeling tasks.