Overview
Problem Formulation
Feature selection is the problem of identifying a subset of relevant features for model training. In high-dimensional settings, many features may be redundant or irrelevant, leading to:
Increased computational cost
Reduced model interpretability
Risk of overfitting to noise
Difficulty in feature engineering for domain experts
The feature selection problem can be formulated as follows. Given a dataset \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}\) where \(\mathbf{x}_i \in \mathbb{R}^d\) and \(y_i\) is a target, we seek to learn both a selection mechanism and a predictor model:
where:
\(\mathbf{g}(\mathbf{x}) \in \{0, 1\}^d\) is a discrete feature selector
\(\mathbf{f}\) is the prediction model
\(\mathcal{L}\) is the task loss (e.g., classification loss)
\(\Omega(\mathbf{g})\) encourages sparsity
\(\lambda\) balances accuracy vs. sparsity
Stochastic Gating Approach
The challenge is that directly optimizing discrete gates \(\mathbf{g}\) is intractable. Stochastic gating methods replace discrete variables with continuous relaxations:
The continuous variables \(\mathbf{z}\) are passed through a smooth mapping to approximate binary gates, enabling gradient-based optimization while maintaining the sparsity-inducing property.
Methods Overview
Stochastic Gates (STG)
Based on Yamada et al. (2020), STG uses Gaussian relaxation to approximate Bernoulli variables.
Forward pass: - Sample from Gaussian: \(z_d = \mu_d + \sigma \epsilon_d\) - Apply hard clipping: \(\tilde{z}_d = \text{clamp}(z_d, [0, 1])\) - Gate features: \(\tilde{\mathbf{x}} = \tilde{\mathbf{z}} \odot \mathbf{x}\)
At inference: Use deterministic gates \(\tilde{z}_d = \text{clamp}(\mu_d, [0, 1])\)
Regularization: Encourages sparse selection via:
where \(\Phi\) is the cumulative normal distribution.
Straight-Through Estimator (STE)
Based on Bengio et al. (2013), STE uses binary gates with gradient approximation.
Forward pass: - Compute probabilities: \(p_d = \sigma(\text{logit}_d)\) - Hard binarization: \(g_d = \begin{cases} 1 & p_d > 0.5 \\ 0 & \text{otherwise} \end{cases}\) - Gate features: \(\tilde{\mathbf{x}} = \mathbf{g} \odot \mathbf{x}\)
Gradient approximation: Straight-through allows backpropagation through the binarization:
Regularization:
Gumbel-Softmax
Based on Jang et al. (2017), uses categorical distribution relaxation.
Key idea: Treat feature selection as categorical choice between {off, on} states.
Forward pass (training): - Apply Gumbel-Softmax with temperature \(\tau\) - Use hard=True for discrete samples during forward pass - Temperature annealing: \(\tau \to 0\) during training
Forward pass (inference): Select argmax state
Regularization: Probability of on state:
L1 Regularization
Baseline method using L1 penalty on feature weights.
Model: Learn weights \(\mathbf{w} \in \mathbb{R}^d\) directly:
Regularization:
Interpretation: Weights \(\mathbf{w}\) indirectly indicate feature importance.
Training Strategy
Joint Optimization
The model and feature selector are optimized jointly:
Two-optimizer approach:
Optimizer 1 (model): lower learning rate (e.g., \(\eta_m = 0.001\))
Optimizer 2 (selector): higher learning rate (e.g., \(\eta_s = 0.01\))
Higher selector learning rate allows gates to adapt quickly to the task.
Early Stopping
Training uses validation-based early stopping:
Monitor validation accuracy
Save best model state when validation metric improves
Stop if no improvement for patience epochs (default: 50)
Requires at least 100 epochs before stopping
Lambda Selection
The sparsity parameter \(\lambda\) controls the trade-off between accuracy and feature count. SToG includes grid search for optimal \(\lambda\):
This balances achieving target sparsity while maintaining high accuracy.
References
For detailed references, see References.