========
Overview
========

Problem Formulation
===================

Feature selection is the problem of identifying a subset of relevant features for model training. 
In high-dimensional settings, many features may be redundant or irrelevant, leading to:

- Increased computational cost
- Reduced model interpretability
- Risk of overfitting to noise
- Difficulty in feature engineering for domain experts

The feature selection problem can be formulated as follows. Given a dataset 
:math:`\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}` where :math:`\mathbf{x}_i \in \mathbb{R}^d` 
and :math:`y_i` is a target, we seek to learn both a selection mechanism and a predictor model:

.. math::

   \min_{\mathbf{g}, \mathbf{f}} \mathcal{L}(\mathbf{g}, \mathbf{f}) + \lambda \Omega(\mathbf{g})

where:

- :math:`\mathbf{g}(\mathbf{x}) \in \{0, 1\}^d` is a discrete feature selector
- :math:`\mathbf{f}` is the prediction model
- :math:`\mathcal{L}` is the task loss (e.g., classification loss)
- :math:`\Omega(\mathbf{g})` encourages sparsity
- :math:`\lambda` balances accuracy vs. sparsity

Stochastic Gating Approach
===========================

The challenge is that directly optimizing discrete gates :math:`\mathbf{g}` is intractable. 
Stochastic gating methods replace discrete variables with continuous relaxations:

.. math::

   \mathbf{z}_d = \mu_d + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)

The continuous variables :math:`\mathbf{z}` are passed through a smooth mapping to approximate 
binary gates, enabling gradient-based optimization while maintaining the sparsity-inducing property.

Methods Overview
================

Stochastic Gates (STG)
~~~~~~~~~~~~~~~~~~~~~~

Based on Yamada et al. (2020), STG uses Gaussian relaxation to approximate Bernoulli variables.

**Forward pass:**
- Sample from Gaussian: :math:`z_d = \mu_d + \sigma \epsilon_d`
- Apply hard clipping: :math:`\tilde{z}_d = \text{clamp}(z_d, [0, 1])`
- Gate features: :math:`\tilde{\mathbf{x}} = \tilde{\mathbf{z}} \odot \mathbf{x}`

**At inference:** Use deterministic gates :math:`\tilde{z}_d = \text{clamp}(\mu_d, [0, 1])`

**Regularization:** Encourages sparse selection via:

.. math::

   \Omega(\mu, \sigma) = \sum_{d=1}^{D} P(z_d > 0) = \sum_{d=1}^{D} \Phi\left(\frac{\mu_d}{\sigma}\right)

where :math:`\Phi` is the cumulative normal distribution.

Straight-Through Estimator (STE)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Based on Bengio et al. (2013), STE uses binary gates with gradient approximation.

**Forward pass:**
- Compute probabilities: :math:`p_d = \sigma(\text{logit}_d)`
- Hard binarization: :math:`g_d = \begin{cases} 1 & p_d > 0.5 \\ 0 & \text{otherwise} \end{cases}`
- Gate features: :math:`\tilde{\mathbf{x}} = \mathbf{g} \odot \mathbf{x}`

**Gradient approximation:** Straight-through allows backpropagation through the binarization:

.. math::

   \frac{\partial g_d}{\partial \text{logit}_d} \approx \frac{\partial p_d}{\partial \text{logit}_d}

**Regularization:**

.. math::

   \Omega(p) = \sum_{d=1}^{D} p_d

Gumbel-Softmax
~~~~~~~~~~~~~~

Based on Jang et al. (2017), uses categorical distribution relaxation.

**Key idea:** Treat feature selection as categorical choice between {off, on} states.

**Forward pass (training):**
- Apply Gumbel-Softmax with temperature :math:`\tau`
- Use hard=True for discrete samples during forward pass
- Temperature annealing: :math:`\tau \to 0` during training

**Forward pass (inference):** Select argmax state

**Regularization:** Probability of on state:

.. math::

   \Omega = \sum_{d=1}^{D} p(z_{d,\text{on}} = 1 | \text{logits}_d)

Correlated STG
~~~~~~~~~~~~~~

Extension of STG for datasets with correlated features.

**Motivation:** When features are highly correlated, independent selection can lead to 
selecting all correlated copies. CorrelatedSTG uses group regularization.

**Key addition:** Feature correlation structure :math:`\mathbf{C}` informed regularization:

.. math::

   \Omega_{\text{corr}}(\mu, \mathbf{C}) = \sum_{d=1}^{D} \Phi\left(\frac{\mu_d}{\sigma}\right) + 
   \alpha \sum_{d,d'} C_{dd'} \left(\Phi\left(\frac{\mu_d}{\sigma}\right) - \Phi\left(\frac{\mu_{d'}}{\sigma}\right)\right)^2

L1 Regularization
~~~~~~~~~~~~~~~~~

Baseline method using L1 penalty on feature weights.

**Model:** Learn weights :math:`\mathbf{w} \in \mathbb{R}^d` directly:

.. math::

   \tilde{\mathbf{x}} = \mathbf{w} \odot \mathbf{x}

**Regularization:**

.. math::

   \Omega(w) = \sum_{d=1}^{D} |w_d|

**Interpretation:** Weights :math:`\mathbf{w}` indirectly indicate feature importance.

Training Strategy
=================

Joint Optimization
~~~~~~~~~~~~~~~~~~~

The model and feature selector are optimized jointly:

.. math::

   \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}}(\mathbf{f}, \mathbf{g}) + \lambda \Omega(\mathbf{g})

**Two-optimizer approach:**

- Optimizer 1 (model): lower learning rate (e.g., :math:`\eta_m = 0.001`)
- Optimizer 2 (selector): higher learning rate (e.g., :math:`\eta_s = 0.01`)

Higher selector learning rate allows gates to adapt quickly to the task.

Early Stopping
~~~~~~~~~~~~~~

Training uses validation-based early stopping:

1. Monitor validation accuracy
2. Save best model state when validation metric improves
3. Stop if no improvement for *patience* epochs (default: 50)
4. Requires at least 100 epochs before stopping

Lambda Selection
~~~~~~~~~~~~~~~~

The sparsity parameter :math:`\lambda` controls the trade-off between accuracy and feature count. 
SToG includes grid search for optimal :math:`\lambda`:

.. math::

   \text{score}(\lambda) = \text{accuracy} - 0.5 \cdot |\text{selected\_count} - \text{target\_count}|

This balances achieving target sparsity while maintaining high accuracy.

References
==========

For detailed references, see :doc:`references`.