Overview

Problem Formulation

Feature selection is the problem of identifying a subset of relevant features for model training. In high-dimensional settings, many features may be redundant or irrelevant, leading to:

Increased computational cost
Reduced model interpretability
Risk of overfitting to noise
Difficulty in feature engineering for domain experts

The feature selection problem can be formulated as follows. Given a dataset \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}\) where \(\mathbf{x}_i \in \mathbb{R}^d\) and \(y_i\) is a target, we seek to learn both a selection mechanism and a predictor model:

\[\min_{\mathbf{g}, \mathbf{f}} \mathcal{L}(\mathbf{g}, \mathbf{f}) + \lambda \Omega(\mathbf{g})\]

where:

\(\mathbf{g}(\mathbf{x}) \in \{0, 1\}^d\) is a discrete feature selector
\(\mathbf{f}\) is the prediction model
\(\mathcal{L}\) is the task loss (e.g., classification loss)
\(\Omega(\mathbf{g})\) encourages sparsity
\(\lambda\) balances accuracy vs. sparsity

Stochastic Gating Approach

The challenge is that directly optimizing discrete gates \(\mathbf{g}\) is intractable. Stochastic gating methods replace discrete variables with continuous relaxations:

\[\mathbf{z}_d = \mu_d + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)\]

The continuous variables \(\mathbf{z}\) are passed through a smooth mapping to approximate binary gates, enabling gradient-based optimization while maintaining the sparsity-inducing property.

Methods Overview

Stochastic Gates (STG)

Based on Yamada et al. (2020), STG uses Gaussian relaxation to approximate Bernoulli variables.

Forward pass: - Sample from Gaussian: \(z_d = \mu_d + \sigma \epsilon_d\) - Apply hard clipping: \(\tilde{z}_d = \text{clamp}(z_d, [0, 1])\) - Gate features: \(\tilde{\mathbf{x}} = \tilde{\mathbf{z}} \odot \mathbf{x}\)

At inference: Use deterministic gates \(\tilde{z}_d = \text{clamp}(\mu_d, [0, 1])\)

Regularization: Encourages sparse selection via:

\[\Omega(\mu, \sigma) = \sum_{d=1}^{D} P(z_d > 0) = \sum_{d=1}^{D} \Phi\left(\frac{\mu_d}{\sigma}\right)\]

where \(\Phi\) is the cumulative normal distribution.

Straight-Through Estimator (STE)

Based on Bengio et al. (2013), STE uses binary gates with gradient approximation.

Forward pass: - Compute probabilities: \(p_d = \sigma(\text{logit}_d)\) - Hard binarization: \(g_d = \begin{cases} 1 & p_d > 0.5 \\ 0 & \text{otherwise} \end{cases}\) - Gate features: \(\tilde{\mathbf{x}} = \mathbf{g} \odot \mathbf{x}\)

Gradient approximation: Straight-through allows backpropagation through the binarization:

\[\frac{\partial g_d}{\partial \text{logit}_d} \approx \frac{\partial p_d}{\partial \text{logit}_d}\]

Regularization:

\[\Omega(p) = \sum_{d=1}^{D} p_d\]

Gumbel-Softmax

Based on Jang et al. (2017), uses categorical distribution relaxation.

Key idea: Treat feature selection as categorical choice between {off, on} states.

Forward pass (training): - Apply Gumbel-Softmax with temperature \(\tau\) - Use hard=True for discrete samples during forward pass - Temperature annealing: \(\tau \to 0\) during training

Forward pass (inference): Select argmax state

Regularization: Probability of on state:

\[\Omega = \sum_{d=1}^{D} p(z_{d,\text{on}} = 1 | \text{logits}_d)\]

Correlated STG

Extension of STG for datasets with correlated features.

Motivation: When features are highly correlated, independent selection can lead to selecting all correlated copies. CorrelatedSTG uses group regularization.

Key addition: Feature correlation structure \(\mathbf{C}\) informed regularization:

\[\Omega_{\text{corr}}(\mu, \mathbf{C}) = \sum_{d=1}^{D} \Phi\left(\frac{\mu_d}{\sigma}\right) + \alpha \sum_{d,d'} C_{dd'} \left(\Phi\left(\frac{\mu_d}{\sigma}\right) - \Phi\left(\frac{\mu_{d'}}{\sigma}\right)\right)^2\]

L1 Regularization

Baseline method using L1 penalty on feature weights.

Model: Learn weights \(\mathbf{w} \in \mathbb{R}^d\) directly:

\[\tilde{\mathbf{x}} = \mathbf{w} \odot \mathbf{x}\]

Regularization:

\[\Omega(w) = \sum_{d=1}^{D} |w_d|\]

Interpretation: Weights \(\mathbf{w}\) indirectly indicate feature importance.

Training Strategy

Joint Optimization

The model and feature selector are optimized jointly:

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}}(\mathbf{f}, \mathbf{g}) + \lambda \Omega(\mathbf{g})\]

Two-optimizer approach:

Optimizer 1 (model): lower learning rate (e.g., \(\eta_m = 0.001\))
Optimizer 2 (selector): higher learning rate (e.g., \(\eta_s = 0.01\))

Higher selector learning rate allows gates to adapt quickly to the task.

Early Stopping

Training uses validation-based early stopping:

Monitor validation accuracy
Save best model state when validation metric improves
Stop if no improvement for patience epochs (default: 50)
Requires at least 100 epochs before stopping

Lambda Selection

The sparsity parameter \(\lambda\) controls the trade-off between accuracy and feature count. SToG includes grid search for optimal \(\lambda\):

\[\text{score}(\lambda) = \text{accuracy} - 0.5 \cdot |\text{selected\_count} - \text{target\_count}|\]

This balances achieving target sparsity while maintaining high accuracy.

References

For detailed references, see References.