selectors

Feature Selection Methods

Feature selector implementations.

class SToG.selectors.STGLayer(input_dim: int, sigma: float = 0.5, device: str = 'cpu')[source]

Bases: BaseFeatureSelector

Stochastic Gates (STG) - Original implementation from Yamada et al. 2020. Uses Gaussian-based continuous relaxation of Bernoulli variables.

Reference: “Learning Feature Sparse Principal Subspace” (Yamada et al., ICML 2020)

__init__(input_dim: int, sigma: float = 0.5, device: str = 'cpu')[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]: Apply stochastic gates to input features.

regularization_loss() → Tensor[source]: Compute regularization: sum of probabilities of selection.

get_selection_probs() → Tensor[source]: Get selection probabilities for each feature.

class SToG.selectors.STELayer(input_dim: int, device: str = 'cpu')[source]

Bases: BaseFeatureSelector

Straight-Through Estimator for feature selection. Uses binary gates with gradient flow through sigmoid.

Reference: “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation” (Bengio et al., 2013)

__init__(input_dim: int, device: str = 'cpu')[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]: Apply straight-through gates to input features.

regularization_loss() → Tensor[source]: Compute regularization: sum of selection probabilities.

get_selection_probs() → Tensor[source]: Get selection probabilities for each feature.

class SToG.selectors.GumbelLayer(input_dim: int, temperature: float = 1.0, device: str = 'cpu')[source]

Bases: BaseFeatureSelector

Gumbel-Softmax based feature selector. Uses categorical distribution over {off, on} for each feature.

Reference: “Categorical Reparameterization with Gumbel-Softmax” (Jang et al., ICLR 2017)

Fixed implementation: Properly handles batch dimension and sampling.

__init__(input_dim: int, temperature: float = 1.0, device: str = 'cpu')[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Apply Gumbel-Softmax gates to input features.

Parameters:: x – Input tensor of shape [batch_size, input_dim]
Returns:: Gated input tensor

regularization_loss() → Tensor[source]: Compute regularization: sum of “on” state probabilities.

get_selection_probs() → Tensor[source]: Get selection probabilities for each feature.

set_temperature(temperature: float)[source]: Update temperature for annealing schedule.

class SToG.selectors.CorrelatedSTGLayer(input_dim: int, sigma: float = 0.5, group_penalty: float = 0.1, device: str = 'cpu')[source]

Bases: BaseFeatureSelector

STG with explicit handling of correlated features. Based on “Adaptive Group Sparse Regularization for Deep Neural Networks”. Uses group structure to handle feature correlation.

Reference: “Adaptive Group Sparse Regularization for Deep Neural Networks”

__init__(input_dim: int, sigma: float = 0.5, group_penalty: float = 0.1, device: str = 'cpu')[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]: Apply correlated stochastic gates to input features.

regularization_loss() → Tensor[source]: Compute regularization with correlation penalty.

get_selection_probs() → Tensor[source]: Get selection probabilities for each feature.

class SToG.selectors.L1Layer(input_dim: int, device: str = 'cpu')[source]

Bases: BaseFeatureSelector

L1 regularization on input layer weights. Baseline comparison method for feature selection.

__init__(input_dim: int, device: str = 'cpu')[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]: Apply L1 weights to input features.

regularization_loss() → Tensor[source]: Compute L1 regularization: sum of absolute weights.

get_selection_probs() → Tensor[source]: Get feature importance (absolute weights).

get_selected_features(threshold: float = 0.1) → ndarray[source]: Get selected features based on weight magnitude.

Overview

The SToG.selectors module implements five feature selection methods, each with different properties and use cases.

Stochastic Gates (STGLayer)

class SToG.selectors.STGLayer(input_dim: int, sigma: float = 0.5, device: str = 'cpu')[source]

Bases: BaseFeatureSelector

Stochastic Gates (STG) - Original implementation from Yamada et al. 2020. Uses Gaussian-based continuous relaxation of Bernoulli variables.

Reference: “Learning Feature Sparse Principal Subspace” (Yamada et al., ICML 2020)

__init__(input_dim: int, sigma: float = 0.5, device: str = 'cpu')[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]: Apply stochastic gates to input features.

regularization_loss() → Tensor[source]: Compute regularization: sum of probabilities of selection.

get_selection_probs() → Tensor[source]: Get selection probabilities for each feature.

Method: Gaussian-based continuous relaxation (Yamada et al., 2020)

When to use:

Balanced accuracy and sparsity
Need smooth gradient flow
Stable training on most datasets

Parameters:

sigma - Standard deviation of Gaussian noise (default: 0.5)
Larger sigma: more exploration, potentially less sparse
Smaller sigma: deterministic behavior, faster convergence

Example:

from SToG import STGLayer
selector = STGLayer(input_dim=100, sigma=0.5)

Straight-Through Estimator (STELayer)

class SToG.selectors.STELayer(input_dim: int, device: str = 'cpu')[source]

Bases: BaseFeatureSelector

Straight-Through Estimator for feature selection. Uses binary gates with gradient flow through sigmoid.

Reference: “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation” (Bengio et al., 2013)

__init__(input_dim: int, device: str = 'cpu')[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]: Apply straight-through gates to input features.

regularization_loss() → Tensor[source]: Compute regularization: sum of selection probabilities.

get_selection_probs() → Tensor[source]: Get selection probabilities for each feature.

Method: Binary gates with gradient approximation (Bengio et al., 2013)

When to use:

Need explicit binary decisions (on/off)
Prefer fast convergence
Working with small feature sets

Advantages:

Produces true binary gates at inference
Fast training convergence
Clear feature selection (no fuzzy boundaries)

Disadvantages:

Gradient approximation may be biased
Can get stuck in local optima
May over-select features

Example:

from SToG import STELayer
selector = STELayer(input_dim=100)

Gumbel-Softmax (GumbelLayer)

class SToG.selectors.GumbelLayer(input_dim: int, temperature: float = 1.0, device: str = 'cpu')[source]

Bases: BaseFeatureSelector

Gumbel-Softmax based feature selector. Uses categorical distribution over {off, on} for each feature.

Reference: “Categorical Reparameterization with Gumbel-Softmax” (Jang et al., ICLR 2017)

Fixed implementation: Properly handles batch dimension and sampling.

__init__(input_dim: int, temperature: float = 1.0, device: str = 'cpu')[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]

Apply Gumbel-Softmax gates to input features.

Parameters:: x – Input tensor of shape [batch_size, input_dim]
Returns:: Gated input tensor

regularization_loss() → Tensor[source]: Compute regularization: sum of “on” state probabilities.

get_selection_probs() → Tensor[source]: Get selection probabilities for each feature.

set_temperature(temperature: float)[source]: Update temperature for annealing schedule.

Method: Categorical distribution relaxation (Jang et al., 2017)

When to use:

Need principled probabilistic framework
Working with discrete latent variables
Can afford temperature annealing schedule

Parameters:

temperature - Initial temperature (default: 1.0)
Temperature annealing: \(\tau \to 0\) during training
Smaller temperature: more discrete behavior

Advantages:

Theoretically grounded in Gumbel distribution
Flexible temperature schedule
Good for categorical problems

Example:

from SToG import GumbelLayer
selector = GumbelLayer(input_dim=100, temperature=1.0)

Correlated STG (CorrelatedSTGLayer)

class SToG.selectors.CorrelatedSTGLayer(input_dim: int, sigma: float = 0.5, group_penalty: float = 0.1, device: str = 'cpu')[source]

Bases: BaseFeatureSelector

STG with explicit handling of correlated features. Based on “Adaptive Group Sparse Regularization for Deep Neural Networks”. Uses group structure to handle feature correlation.

Reference: “Adaptive Group Sparse Regularization for Deep Neural Networks”

__init__(input_dim: int, sigma: float = 0.5, group_penalty: float = 0.1, device: str = 'cpu')[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]: Apply correlated stochastic gates to input features.

regularization_loss() → Tensor[source]: Compute regularization with correlation penalty.

get_selection_probs() → Tensor[source]: Get selection probabilities for each feature.

Method: STG variant for correlated features

When to use:

Features have high correlation
Want to avoid selecting all correlated copies
Need group-aware feature selection

How it works:

Computes feature correlation structure
Adds group regularization penalty
Encourages correlated groups to be selected together or not at all

Parameters:

correlation_threshold - Threshold for grouping correlated features
group_alpha - Weight of group regularization

Example:

from SToG import CorrelatedSTGLayer
selector = CorrelatedSTGLayer(
    input_dim=100,
    sigma=0.5,
    correlation_threshold=0.8
)

L1 Regularization (L1Layer)

class SToG.selectors.L1Layer(input_dim: int, device: str = 'cpu')[source]

Bases: BaseFeatureSelector

L1 regularization on input layer weights. Baseline comparison method for feature selection.

__init__(input_dim: int, device: str = 'cpu')[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tensor[source]: Apply L1 weights to input features.

regularization_loss() → Tensor[source]: Compute L1 regularization: sum of absolute weights.

get_selection_probs() → Tensor[source]: Get feature importance (absolute weights).

get_selected_features(threshold: float = 0.1) → ndarray[source]: Get selected features based on weight magnitude.

Method: Classical L1 penalty on feature weights

When to use:

Baseline comparison
Want interpretable feature weights
Need simple, proven method

How it works:

Learns feature weights \(w \in \mathbb{R}^d\)
Gates input: \(\tilde{x} = w \odot x\)
Encourages small weights via L1 penalty

Advantages:

Simple and interpretable
Fast convergence
Well-studied statistical properties

Disadvantages:

Soft selection (weights are continuous)
May not achieve exact sparsity
Features selected by magnitude, not binary gates

Example:

from SToG import L1Layer
selector = L1Layer(input_dim=100)

Method Comparison

Feature Selection Methods Comparison
Method	Convergence	Sparsity	Interpretability	Stability	Use Case
STG	Medium	Good	Good	High	General purpose
STE	Fast	Good	Excellent	Medium	Binary selection
Gumbel	Medium	Good	Good	Medium	Categorical
CorrelatedSTG	Slow	Excellent	Good	High	Correlated features
L1	Fast	Fair	Good	High	Baseline