Feature Selection: A Simplified Guide

Published on July 07, 2025

Feature selection is the systematic process of identifying and selecting a subset of input variables that contribute most effectively to the predictive accuracy of a machine learning model while reducing dimensionality and computational complexity. It is a critical step in preparing data for machine learning models. Its goal is to identify and retain only the most informative features from a dataset, eliminating those that are irrelevant or redundant. It is relevant for both supervised and unsupervised models, as reducing dimensionality can benefit clustering, visualization, and other unsupervised tasks as well.

Figure: There are four key reasons to apply feature selection: (1) Improved predictive performance by removing noise and irrelevant features; (2) Reduced training time due to lower computational load; (3) Dimensionality control, mitigating overfitting in high-dimensional spaces; and (4) Increased interpretability, making models easier to explain and validate.

This guide starts by identifying the most common types of features to eliminate before moving on to more sophisticated filtering and selection techniques.

Identifying Features to Remove

Before applying advanced selection algorithms, it is essential to remove features that, by construction or statistical nature, do not provide informative value to the model. These features fall into two main categories:

Figure: Two key types of non-informative features to discard before applying selection algorithms: (1) Irrelevant features lack predictive value due to no relationship with the target (e.g., IDs, constant columns). (2) Redundant features provide overlapping information already captured by other variables, often due to high correlation or dependency..

Tip: The presence of such features tends to obscure real patterns in the data, can lead to unstable models (especially in linear algorithms), and generally increases complexity without real performance gains.

Constant or Near-Constant Features

Features that show the same (or almost the same) value across all samples have zero or negligible variance. Since they do not vary, they do not help distinguish between instances and should be eliminated upfront.
Additionally, features with a high proportion of missing values can often be removed at this stage, as they provide little information for modeling.

Practical tool: sklearn.feature_selection.VarianceThreshold can automatically remove features with variance below a threshold, typically between 1e-3 and 1e-2, depending on data scale.

Redundant Features by Correlation or Dependency

Redundancy can arise whenever two or more features behave very similarly or when one feature can be determined by others. The most common causes include:

High correlation: Two features with a Pearson correlation coefficient |ρ| > 0.9 essentially provide the same information. For categorical variables, metrics such as Cramér’s V can be used to assess association.
Multicollinearity: Occurs when a feature can be explained as a linear combination of others, measured by the Variance Inflation Factor (VIF). VIF values above 5 (or 10, for more permissive criteria) suggest unacceptable collinearity. [More]

Recommended strategies: Identify highly correlated pairs and remove one from each pair, or use an iterative VIF approach, eliminating features with high values until all are below the defined threshold.

Feature Selection Approaches

Feature selection can be performed using different strategies, grouped into three main categories: filters, wrappers, and embedded methods. Each has advantages and disadvantages depending on the problem context, project goals, and available resources.

Filter Methods	Wrapper Methods	Embedded Methods
General: Evaluate each feature independently using statistical properties (variance, correlation, independence tests). Selection occurs before model training, without involving the ML algorithm. These methods are model-independent (do not use any ML model for selection).	General: Use a specific ML model to evaluate subsets of features by training and validation. The achieved performance guides the selection process.	General: Feature selection is integrated into model training. Some algorithms assign weights or importances, allowing automatic elimination of less relevant features.
Examples: Variance Pearson correlation Mutual information Chi-squared test (χ²) F-score (ANOVA) Relief	Examples: Recursive Feature Elimination (RFE) Sequential selection (forward/backward) Genetic algorithms	Examples: LASSO (L1), Ridge (L2), Elastic Net Random Forest Gradient Boosting (XGBoost, LightGBM)
Pros: Computationally efficient Model-agnostic Easy to interpret	Pros: Consider feature interactions Can maximize model performance	Pros: More efficient than wrappers Capture complex/non-linear relationships Selection & training in 1 step
Cons: Ignore feature interactions May retain redundancies Do not directly optimize model performance	Cons: High computational cost Risk of overfitting without proper cross-validation	Cons: Coupled to the chosen model Interpretation can be complex with correlated features

Python tools: The sklearn.feature_selection module provides several filter and wrapper methods for feature selection.

Feature Selection Pipeline

Feature selection is best structured as a clear, iterative pipeline. The diagram below summarizes the recommended flow:
In real-world projects, it is common to revisit and adjust steps based on validation results, iterating until satisfactory performance and interpretability are achieved.

Figure: Systematic Feature Selection Pipeline. Each step feeds into the next, with feedback for iterative refinement.

Data Preprocessing: Handle missing values, encode categoricals, normalize/standardize.
Remove Irrelevant/Redundant: Drop constant, highly missing, or highly correlated features.
Filter-Based Selection: Rank features by univariate metrics (e.g., mutual info, correlation, chi²).
Wrapper/Embedded: Use model-based selection (e.g., RFE, LASSO, Random Forest importances).
Cross-Validation: Evaluate model performance with selected features.
Interpretation & Adjustment: Review selected features, adjust pipeline if needed.
Note: Document all decisions for reproducibility and future reference.

Final Considerations

There is no universal "best method": The choice depends on data type, dataset size, ML model to be used, computational resources, and objectives (performance vs. interpretability).
Iteration: The process can be iterative. Start with filters, then apply wrappers/embedded methods on the reduced set. Cross-validation is essential at every decision point that could lead to overfitting.
Domain knowledge: Expert knowledge about the problem can often guide the selection or removal of certain features, as well as the definition of thresholds.
Impact of redundancy: Remember that redundant features not only increase computational complexity but can give undue importance to certain aspects of the data and destabilize some models.
Documentation: Documenting the process and the rationale for each decision is crucial for reproducibility and interpretability of the results obtained.

References

Extração de Conhecimento de Dados. Gama, João; Carvalho, AP de L; Faceli, Katti; Lorena, Ana Carolina; Oliveira, Márcia. (2017). Edições Sílabo. [Link]
An Introduction to Variable and Feature Selection. Guyon, I., & Elisseeff, A. (2003). Journal of Machine Learning Research, 3, 1157-1182. [PDF]